查看原文
其他

GATK4的CNV流程-hg38

jimmy 生信技能树 2022-06-07


至少gatk-4.0.2.1.zip无法走CNV流程,我重新下载了目前最新版的才能顺利运行:

wget https://github.com/broadinstitute/gatk/releases/download/4.0.3.0/gatk-4.0.3.0.zip

首先制作外显子坐标记录文件

##follow pdf from workshop
## /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38
##bed to intervals_list
cat ~/annotation/CCDS/human/exon_probe.hg38.gene.bed|awk '{print "chr"$0}' >hg38.chr.bed
java -jar ~/biosoft/picardtools/2.9.2/picard.jar  BedToIntervalList \
I=hg38.chr.bed \
O=list.interval_list \
SD=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.dict

#
#Preprocess Intervals

/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk  PreprocessIntervals \
-L list.interval_list \
--sequence-dictionary /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.dict \
--reference /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta  \
--padding 250 \
--bin-length 0 \
--interval-merging-rule OVERLAPPING_ONLY \
--output targets.preprocessed.interval.list

input文件是 exon_probe.hg38.gene.bed,见我前面的教程(数据处理过程中有的是意外),制作得到 targets.preprocessed.interval.list 这个文件后面需要用,如下:

tail /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/cnv/targets.preprocessed.interval.list
chrY    24868969    24869539    +   .
chrY    24895438    24896008    +   .
chrY    24896255    24896788    +   .
chrY    24900862    24901424    +   .
chrY    25037848    25038365    +   .
chrY    25038559    25039163    +   .
chrY    25041519    25042135    +   .
chrY    25043696    25044272    +   .
chrY    25622193    25624259    +   .
chrY    25624260    25624776    +   .

然后把bam文件转为外显子reads数

module load java/1.8.0_91
## wget https://github.com/broadinstitute/gatk/releases/download/4.0.3.0/gatk-4.0.3.0.zip
GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
dict=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.dict
INDEX=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk
DBSNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/dbsnp_146.hg38.vcf.gz
kgSNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
kgINDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
tl=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/cnv/targets.preprocessed.interval.list

for i in  /home/jianmingzeng/data/public/oscc/WES/gvcf/*.bam
do
j=$(basename "$i" _recal.bam)
echo $j
## step1 : CollectReadCounts
time $GATK  --java-options "-Xmx10G -Djava.io.tmpdir=./" CollectReadCounts \
-I $i \
-L $tl \
-R $GENOME \
--format HDF5  \
--interval-merging-rule OVERLAPPING_ONLY \
--output $j.clean_counts.hdf5
## step2 : CollectAllelicCounts
time $GATK  --java-options "-Xmx10G -Djava.io.tmpdir=./" CollectAllelicCounts \
-I $i \
-L $tl \
-R $GENOME \
-O $j.allelicCounts.tsv
done
### 注意这个CollectAllelicCounts步骤非常耗时,而且占空间

mkdir allelicCounts
mv *.allelicCounts.tsv ./allelicCounts
mkdir counts
mv *.clean_counts.hdf5  ./counts

接着合并所有的normal样本的数据创建 cnvponM.pon.hdf5

GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk
$GATK  --java-options "-Xmx20g" CreateReadCountPanelOfNormals \
--minimum-interval-median-percentile 5.0 \
--output cnvponM.pon.hdf5 \
--input counts/OSCC_01_N.clean_counts.hdf5 \
--input counts/OSCC_04_N.clean_counts.hdf5

值得注意的是这个cnvponM.pon.hdf5文件, h5py文件是存放两类对象的容器,数据集(dataset)和组(group),dataset类似数组类的数据集合,和numpy的数组差不多。group是像文件夹一样的容器,它好比python中的字典,有键(key)和值(value)。group中可以存放dataset或者其他的group。”键”就是组成员的名称,”值”就是组成员对象本身(组或者数据集)。

可能会报错,如下:

HDF5-DIAG: Error detected in HDF5 (1.8.14) thread 0:
 #000: /mnt/scr1/abyrne/HDFJava-platypus-2.11/native/HDF5-prefix/src/HDF5/src/H5F.c line 604 in H5Fopen(): unable to open file
   major: File accessibilty
   minor: Unable to open file
 #001: /mnt/scr1/abyrne/HDFJava-platypus-2.11/native/HDF5-prefix/src/HDF5/src/H5Fint.c line 1085 in H5F_open(): unable to read superblock
   major: File accessibilty
   minor: Read failed
 #002: /mnt/scr1/abyrne/HDFJava-platypus-2.11/native/HDF5-prefix/src/HDF5/src/H5Fsuper.c line 277 in H5F_super_read(): file signature not found
   major: File accessibilty
   minor: Not an HDF5 file

因为我前面步骤选择了 --format TSV 后来修改成了 --format HDF5

最后走真正的CNV流程

GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk
for i in ./counts/*
do
j=$(basename "$i" .clean_counts.hdf5)
$GATK  --java-options "-Xmx20g" DenoiseReadCounts \
-I $i \
--count-panel-of-normals q \
--standardized-copy-ratios $j.clean.standardizedCR.tsv \
--denoised-copy-ratios $j.clean.denoisedCR.tsv
done

mkdir denoisedCR standardizedCR segments cnv_plots
mv *denoisedCR.tsv ./denoisedCR
mv *standardizedCR.tsv ./standardizedCR

for i in denoisedCR/*
do
j=$(basename "$i" .clean.denoisedCR.tsv)
## ModelSegments的时候有两个策略,是否利用CollectAllelicCounts的结果
$GATK   --java-options "-Xmx20g" ModelSegments \
--denoised-copy-ratios $i \
--output segments \
--output-prefix $j
## 如果要利用CollectAllelicCounts的结果就需要增加两个参数,这里就不讲解了。

$
GATK   --java-options "-Xmx20g" CallCopyRatioSegments \
-I segments/$j.cr.seg \
-O segments/$j.clean.called.seg

done

这里面有两个绘图函数,PlotDenoisedCopyRatios 和 PlotModeledSegments ,可以选择性运行。

$GATK   --java-options "-Xmx20g" PlotDenoisedCopyRatios \
--standardized-copy-ratios ./standardizedCR/$j.clean.standardizedCR.tsv \
--denoised-copy-ratios $i \
--sequence-dictionary $dict \
--output cnv_plots \
--output-prefix $j

$
GATK   --java-options "-Xmx20g" PlotModeledSegments \
--denoised-copy-ratios $i \
--segments segments/$j.modelFinal.seg \
--sequence-dictionary $dict \
--output cnv_plots \
--output-prefix $j.clean

更多小功能,可以参见GATK培训PPT咯。

绘图必须的R包有

install.packages('optparse')
data.table
ggplot2

生信技能树GATK4系列教程

GATK4的gvcf流程

你以为的可能不是你以为的

新鲜出炉的GATK4培训教材全套PPT,赶快下载学习吧

曾老湿最新私已:GATK4实战教程

我不想赚你的钱,不行吗? (推荐阅读)

值得一提的是,对肿瘤外显子来分析CNV, 我测试过很多工具了,这个GATK的值得一试!

WES的CNV探究-conifer软件使用

单个样本NGS数据如何做拷贝数变异分析呢

肿瘤配对样本用varscan 做cnv分析

使用cnvkit来对大批量wes样本找cnv

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存