肿瘤外显子数据处理系列教程(七)maftools可视化
大家好,上一周我们用王凯老师的ANNOVAR软件给vcf文件做了注释,接下来我们通过强大的可视化R包maftools来做一个可视化,看看我们的样本的突变情况。
上节补充
上一节我们在用ANNOVAR做联合注释的脚本应该适当做一下修改,增加refGene的注释,即改为:
cat annovar.sh
cat config|while read id
do
~/biosoft/ANNOVAR/annovar/table_annovar.pl ${id}.vcf ~/biosoft/ANNOVAR/annovar/humandb/ \
-buildver hg38 \
-out ${id} \
-remove \
-protocol refGene,knownGene,clinvar_20170905 \
-operation g,g,f \
-nastring . \
-vcfinput
done
这样修改的原因是我们后面要使用maftools
中一个函数annovarToMaf
将ANNOVAR注释的结果转为maf,而这个函数annovarToMaf
要求要有refGene或者ensGene的注释。
数据处理
当我们用ANNOVAR注释好了之后,会得到3个文件,其中一个是以txt为后缀,如:case1_biorep_A_techrep_filter.hg38_multianno.txt
,我们看一下这个文件的前两行,即列名和第一条变异注释的详细信息:
head -2 case1_biorep_A_techrep_filter.hg38_multianno.txt
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene GeneDetail.knownGene ExonicFunc.knownGene AAChange.knownGene CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID Otherinfo
chr1 6146376 6146376 G T exonic CHD5 . nonsynonymous SNV CHD5:NM_015557:exon11:c.C1638A:p.N546K exonic CHD5 . nonsynonymous SNV CHD5:uc001amb.3:exon11:c.C1638A:p.N546K,CHD5:uc057btb.1:exon11:c.C1638A:p.N546K . . 0.25 . 75 chr1 6146376 . G T . PASS DP=184;ECNT=1;NLOD=22.56;N_ART_LOD=-1.888e+00;POP_AF=1.000e-05;P_CONTAM=0.00;P_GERMLINE=-4.514e+01;TLOD=5.41 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:97,3:0.058:48,1:49,2:37,38:154,210:60:20:0.020,0.030,0.030:0.018,3.223e-03,0.979 0/0:75,0:0.020:45,0:30,0:36,0:154,0:0:0
接下来我们需要做一个处理,就是在这个txt文件的每一条变异信息中增加一列,为样本名,而最后的一列Otherinfo
我们暂时丢弃掉,处理的方法如下:
cat config |head -1|while read id;do (grep -v 'Chr' ${id}_filter.hg38_multianno.txt |cut -f 1-20|awk '{print $0"\t""'${id}'"}' >${id}.maf );done
同样config中是每一个样本的样本名:
cat config
case1_biorep_A_techrep
case2_biorep_A
case3_biorep_A
......
case4_techrep_2
case5_techrep_2
case6_techrep_2
这样每个样本就得到一个新的文件,暂且以maf为后缀,里面的每一条记录的最后都增加了样本名,如:
chr1 6146376 6146376 G T exonic CHD5 . nonsynonymous SNV CHD5:NM_015557:exon11:c.C1638A:p.N546K exonic CHD5 . nonsynonymous SNV CHD5:uc001amb.3:exon11:c.C1638A:p.N546K,CHD5:uc057btb.1:exon11:c.C1638A:p.N546K . . . . . case1_biorep_A_techrep
然后还需要增加一行列名,可以从原先的文件中获取第一行即列名,把最后的Otherinfo
替换成Tumor_Sample_Barcode
,如下:
head -1 case1_biorep_A_techrep_filter.hg38_multianno.txt|sed 's/Otherinfo/Tumor_Sample_Barcode/' >header
最后我们做一下合并:
cat header *maf >all_sample.maf
查看一下最后的文件前两行:
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene GeneDetail.knownGene ExonicFunc.knownGene AAChange.knownGene CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID Tumor_Sample_Barcode
chr1 6146376 6146376 G T exonic CHD5 . nonsynonymous SNV CHD5:NM_015557:exon11:c.C1638A:p.N546K exonic CHD5 . nonsynonymous SNV CHD5:uc001amb.3:exon11:c.C1638A:p.N546K,CHD5:uc057btb.1:exon11:c.C1638A:p.N546K . . . . . case1_biorep_A_techrep
maftools可视化
maftools是一个非常强大的可视化软件,我已经多次向大家推荐学习:maftools,在这里我们用到了其中一个函数annovarToMaf
## 清空环境
rm(list = ls())
options(stringsAsFactors = F)
## 安装或载入maftools包
if (!require("maftools"))
BiocManager::install("maftools")
library('maftools')
## 读取数据
var.annovar.maf <- annovarToMaf(annovar = "all_sample.maf",
refBuild = 'hg38',
tsbCol = 'Tumor_Sample_Barcode',
table = 'refGene',
MAFobj = T)
laml = var.annovar.maf
## 检查数据
unique(laml@data$Tumor_Sample_Barcode)
getSampleSummary(laml)
getGeneSummary(laml)
getFields(laml)
save(laml, file = 'input_merge.Rdata')
rm(list = ls())
require(maftools)
options(stringsAsFactors = F)
load(file = 'input_merge.Rdata')
project = 'merge'
## 突变类型分布
png(paste0('plotmafSummary_', project, '.png'),res = 150,width = 1080,height = 1080)
plotmafSummary(maf = laml,
rmOutlier = TRUE,
showBarcodes = T,
textSize = 0.4,
addStat = 'median',
dashboard = TRUE,
titvRaw = FALSE)
dev.off()
## top30基因突变全景图
png(paste0('oncoplot_top30_', project, '.png'),res = 150,width = 1080,height = 1080)
oncoplot(maf = laml,
top = 30,
fontSize = 0.5,
showTumorSampleBarcodes = T)
dev.off()
## 突变可能导致TP53的氨基酸序列改变
png(paste0('TP53', '.png'),res = 150,width = 1080,height = 1080)
maftools::lollipopPlot(laml,
gene = 'TP53',
AACol = 'AAChange.refGene',
labelPos = 'all')
dev.off()
当然大家可能根据自己的情况,来进一步探索自己的数据,这里我们就展示到这里,今天我们就先到这里,下一周我们将对比一下几种注释ANNOVAR,VEP,GATK,SnpEff软件的注释之间的异同点,我们下周再见,祝大家国庆节快乐。