其他
GATK4的CNV流程-hg38
至少gatk-4.0.2.1.zip
无法走CNV流程,我重新下载了目前最新版的才能顺利运行:
wget https://github.com/broadinstitute/gatk/releases/download/4.0.3.0/gatk-4.0.3.0.zip
首先制作外显子坐标记录文件
#follow pdf from workshop
# /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38
#bed to intervals_list
cat ~/annotation/CCDS/human/exon_probe.hg38.gene.bed|awk '{print "chr"$0}' >hg38.chr.bed
java -jar ~/biosoft/picardtools/2.9.2/picard.jar BedToIntervalList \
I=hg38.chr.bed \
O=list.interval_list \
SD=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.dict
#Preprocess Intervals
/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk PreprocessIntervals \
-L list.interval_list \
--sequence-dictionary /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.dict \
--reference /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta \
--padding 250 \
--bin-length 0 \
--interval-merging-rule OVERLAPPING_ONLY \
--output targets.preprocessed.interval.list
input文件是 exon_probe.hg38.gene.bed
,见我前面的教程(数据处理过程中有的是意外),制作得到 targets.preprocessed.interval.list
这个文件后面需要用,如下:
tail /home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/cnv/targets.preprocessed.interval.list
chrY 24868969 24869539 + .
chrY 24895438 24896008 + .
chrY 24896255 24896788 + .
chrY 24900862 24901424 + .
chrY 25037848 25038365 + .
chrY 25038559 25039163 + .
chrY 25041519 25042135 + .
chrY 25043696 25044272 + .
chrY 25622193 25624259 + .
chrY 25624260 25624776 + .
然后把bam文件转为外显子reads数
module load java/1.8.0_91
# wget https://github.com/broadinstitute/gatk/releases/download/4.0.3.0/gatk-4.0.3.0.zip
GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
dict=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.dict
INDEX=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk
DBSNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/dbsnp_146.hg38.vcf.gz
kgSNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
kgINDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
tl=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/cnv/targets.preprocessed.interval.list
for i in /home/jianmingzeng/data/public/oscc/WES/gvcf/*.bam
do
j=$(basename "$i" _recal.bam)
echo $j
# step1 : CollectReadCounts
time $GATK --java-options "-Xmx10G -Djava.io.tmpdir=./" CollectReadCounts \
-I $i \
-L $tl \
-R $GENOME \
--format HDF5 \
--interval-merging-rule OVERLAPPING_ONLY \
--output $j.clean_counts.hdf5
# step2 : CollectAllelicCounts
time $GATK --java-options "-Xmx10G -Djava.io.tmpdir=./" CollectAllelicCounts \
-I $i \
-L $tl \
-R $GENOME \
-O $j.allelicCounts.tsv
done
### 注意这个CollectAllelicCounts步骤非常耗时,而且占空间
mkdir allelicCounts
mv *.allelicCounts.tsv ./allelicCounts
mkdir counts
mv *.clean_counts.hdf5 ./counts
接着合并所有的normal样本的数据创建 cnvponM.pon.hdf5
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk
GATK --java-options "-Xmx20g" CreateReadCountPanelOfNormals \
--minimum-interval-median-percentile 5.0 \
--output cnvponM.pon.hdf5 \
--input counts/OSCC_01_N.clean_counts.hdf5 \
--input counts/OSCC_04_N.clean_counts.hdf5
值得注意的是这个cnvponM.pon.hdf5文件, h5py文件是存放两类对象的容器,数据集(dataset)和组(group),dataset类似数组类的数据集合,和numpy的数组差不多。group是像文件夹一样的容器,它好比python中的字典,有键(key)和值(value)。group中可以存放dataset或者其他的group。”键”就是组成员的名称,”值”就是组成员对象本身(组或者数据集)。
可能会报错,如下:
HDF5-DIAG: Error detected in HDF5 (1.8.14) thread 0:
#000: /mnt/scr1/abyrne/HDFJava-platypus-2.11/native/HDF5-prefix/src/HDF5/src/H5F.c line 604 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#001: /mnt/scr1/abyrne/HDFJava-platypus-2.11/native/HDF5-prefix/src/HDF5/src/H5Fint.c line 1085 in H5F_open(): unable to read superblock
major: File accessibilty
minor: Read failed
#002: /mnt/scr1/abyrne/HDFJava-platypus-2.11/native/HDF5-prefix/src/HDF5/src/H5Fsuper.c line 277 in H5F_super_read(): file signature not found
major: File accessibilty
minor: Not an HDF5 file
因为我前面步骤选择了 --format TSV
后来修改成了 --format HDF5
最后走真正的CNV流程
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk
for i in ./counts/*
do
j=$(basename "$i" .clean_counts.hdf5)
GATK --java-options "-Xmx20g" DenoiseReadCounts \
-I $i \
--count-panel-of-normals q \
--standardized-copy-ratios $j.clean.standardizedCR.tsv \
--denoised-copy-ratios $j.clean.denoisedCR.tsv
done
mkdir denoisedCR standardizedCR segments cnv_plots
mv *denoisedCR.tsv ./denoisedCR
mv *standardizedCR.tsv ./standardizedCR
for i in denoisedCR/*
do
j=$(basename "$i" .clean.denoisedCR.tsv)
# ModelSegments的时候有两个策略,是否利用CollectAllelicCounts的结果
GATK --java-options "-Xmx20g" ModelSegments \
--denoised-copy-ratios $i \
--output segments \
--output-prefix $j
# 如果要利用CollectAllelicCounts的结果就需要增加两个参数,这里就不讲解了。
GATK --java-options "-Xmx20g" CallCopyRatioSegments \
-I segments/$j.cr.seg \
-O segments/$j.clean.called.seg
done
这里面有两个绘图函数,PlotDenoisedCopyRatios
和 PlotModeledSegments
,可以选择性运行。
GATK --java-options "-Xmx20g" PlotDenoisedCopyRatios \
--standardized-copy-ratios ./standardizedCR/$j.clean.standardizedCR.tsv \
--denoised-copy-ratios $i \
--sequence-dictionary $dict \
--output cnv_plots \
--output-prefix $j
GATK --java-options "-Xmx20g" PlotModeledSegments \
--denoised-copy-ratios $i \
--segments segments/$j.modelFinal.seg \
--sequence-dictionary $dict \
--output cnv_plots \
--output-prefix $j.clean
更多小功能,可以参见GATK培训PPT咯。
绘图必须的R包有
install.packages('optparse')
data.table
ggplot2
生信技能树GATK4系列教程
我不想赚你的钱,不行吗? (推荐阅读)
值得一提的是,对肿瘤外显子来分析CNV, 我测试过很多工具了,这个GATK的值得一试!