ANNOVAR的使用
这是一位非常热心的生信媛小伙伴写的一篇ANOOVAR软件使用说明。学习的最好的方式就是以教为学,通过分享自己习得的知识,既帮助了那些有需要的人,也可以检验自己的学习成果。
生信媛是一个热爱生信的小伙伴互相分享自己学习经验的地方,希望有更多的小伙伴加入生信媛中来。
什么是ANNOVAR
是一款可以利用数据库注释从各种基因组中call出来的variants的软件(包括human genome常见版本hg18, hg19, hg38,鼠类,线虫,果蝇,酵母等)
ANNOVAR的安装
下载地址为:. download需注册,使用教育机构后缀( .edu/.ac/.gov)的邮箱名即可。该程序是用perl语言写的,所以可以作为独立程序运行于各个已经安装Perl的系统。解压直接用即可。
使用案例
注: 下面的示例皆在linux系统中完成
ANNOVAR的安装包里自带了一些常用的数据库,在humandb/目录下
如果要进行其他注释,需要使用-downdb
命令下载数据库到 ‘humandb/’ 目录里:
#下载1000g2015Aug数据库
$perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/
在ANNOVAR的主页面有用于注释的各种数据库的描述,使用者可以根据自行去下载:
ANNOVAR的程序模块
ANNOVAR程序有以下几个模块:
ANNOVAR程序结构
│ annotate_variation.pl #主程序,功能包括下载数据库,三种不同的注释
│ coding_change.pl #可用来推断蛋白质序列
│ convert2annovar.pl #将多种格式转为.avinput的程序
│ retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本
│ table_annovar.pl #注释程序,可一次性完成三种类型的注释
│ variants_reduction.pl #可用来更灵活地定制过滤注释流程
│
├─example #存放示例文件
│
└─humandb #人类注释数据库
ANNOVAR的输入文件
ANNOVAR使用.avinput格式,如以上代码所示,该格式每列以tab分割,需要有以下几个信息:
染色体位置
起始位点
终止位点
参考基因组碱基
突变碱基
……
文件示例:
chrM 302 302 - C chrM 302 . A AC 93.73 PASS AC=4;AF=0.500;AN=8;ClippingRankSum=0.000;DP=121;ExcessHet=3.0103;MLEAC=1;MLEAF=0.500;set=Intersection GT:AD:DP:GQ:PL 0/1:16,9:25:99:183,0,384
chrM 963 963 T - chrM 962 . CT C 258.73 PASS AC=3;AF=0.500;AN=6;ClippingRankSum=0.000;DP=75;ExcessHet=3.0103;MLEAC=1;MLEAF=0.500;set=variant2-variant3-variant4 GT:AD:DP:GQ:PL 0/1:11,11:22:99:296,0,247
chr1 2178293 2178294 GG - chr1 2178292 . TGG T 128 PASS AC=6;AF=1.00;AN=6;DP=31;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;set=variant-variant2-variant3 GT:AD:DP:GQ:PL 1/1:0,4:4:12:165,12,0
chr1 2248382 2248382 - C chr1 2248382 . G GC 98.25 PASS AC=4;AF=1.00;AN=4;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;set=variant2-variant3 GT:AD:DP:GQ:PL 1/1:0,3:3:9:135,9,0
chr1 3278899 3278899 - C chr1 3278899 . A AC 30.71 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=15.35;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:67,6,0
chr1 3817092 3817092 - T chr1 3817092 . C CT 22.73 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=11.36;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:59,6,0
chr1 6067261 6067261 - G chr1 6067261 . T TG 21.73 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=10.87;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:58,6,0
chr1 6211850 6211850 - C chr1 6211850 . A AC 30.71 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=15.35;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:67,6,0
chr1 7538772 7538772 - TTTA chr1 7538772 . C CTTTA 53.70 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=26.85;SOR=2.303;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
chr1 7709649 7709649 - T chr1 7709649 . C CT 62.74 PASS AC=2;AF=0.500;AN=4;ClippingRankSum=0.000;DP=15;ExcessHet=3.0103;MLEAC=1;MLEAF=0.500;MQ=37.00;MQRankSum=0.000;set=variant2-variant3 GT:AD:DP:GQ:PL 0/1:1,3:4:25:100,0,25
ANNOVAR输入文件的格式转换
$ convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput
# -format vcf4 指定格式为vcf
ANNOVAR主要使用convert2annovar.pl程序进行转换,转换后文件是精简过的,主要包含前面提到的5列内容,如果要将原格式的文件的所有内容都包含在转换后的.avinput文件中,可以使用-includeinfo参数;如果需要分开每个sample输出单一的.avinput文件,可以使用-allsample参数,等等。
ANNOVAR还主要支持以下格式转换:
SAMtools pileup format
Complete Genomics format
GFF3-SOLiD calling format
SOAPsnp calling format
MAQ calling format
CASAVA calling format
ANNOVAR注释功能
table_annovar.pl(可一次完成三种类型的注释)
使用ANNOVAR最简单的方法就是使用table_annovar.pl进行注释,它的输入文件可以是多种格式包括VCF,输出文件已Tab分隔,每一列代表着一种注释。
注释命令示例:
$~/biosoft/ANNOVAR/annovar/table_annovar.pl 15_indel_pre.avinput.hg19.variant2.avinput \
~/biosoft/ANNOVAR/annovar/humandb/ \
-buildver hg19 -out myanno -remove \
-protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2015aug_all,1000g2015aug_eur,exac03,avsnp147,dbnsfp30a \
-operation g,r,r,f,f,f,f,f,f -nastring . -csvout
# -buildver hg19 表示使用hg19版本
# -out myanno 表示输出文件的前缀为myanno
# -remove 表示删除注释过程中的临时文件
# -protocol 表示注释使用的数据库,用逗号隔开,且要注意顺序
# -operation 表示对应顺序的数据库的类型(g代表gene-based、r代表region-based、f代表filter-based),用逗号隔开,注意顺序
# -nastring . 表示用点号替代缺省的值
# -csvout 表示最后输出.csv文件
输出的csv文件将包含输入的5列主要信息以及各个数据库里的注释,此外,table_annoval.pl可以直接对vcf文件进行注释(不需要转换格式),注释的内容将会放在vcf文件的“INFO”那一栏。
注释结果示例:
annotate_variation.pl
annotate_variation.pl的注释方式分为三种:
Gene-based annotation
Region-based annotation
Filter-based annotation
#三种命令示例,使用package自带数据进行注释,分别对应三种注释方式
annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 example/ex1.avinput humandb/
Gene-based annotation
顾名思义,Gene-based annotation是根据SNPs以及CNVs的位置信息来确定是否会造成编码序列以及开放阅读框的改变从而影响氨基酸的改变,使用者可以自主选择RefSeq genes, 包括UCSC genes, ENSEMBL genes, GENCODE genes, AceView genes等来进行注释。
命令示例:
$ annotate_variation.pl -geneanno -dbtype refGene -out ex1 -build hg19 example/ex1.avinput humandb/
# -geneanno 表示使用基于基因的注释
# -dbtype refGene 表示使用"refGene"数据库
# -out ex1 表示输出文件以ex1为前缀
因为annotate_variation.pl默认使用gene-based注释类型以及refGene数据库,所以上面的命令可以缺省-geneanno -dbtype refGene
运行结果会产生两个文件:如ex1.variant_function和ex1.exonic_variant_function
第一个文件(ex1.variant_function)包括对于所有突变的注释,通过在文件最前面加入两列,以tab分割
第一列为变异所在基因位置的类型,如外显子,内含子,UTR5,UTR3,基因间等
#ex1.variant_function
# 为方便了解每一列的含义,只输出第一行,然后把制表符改成换行符
[kaiwang@biocluster ~/]$ cat ex1.variant_function | head -n 1 | tr '\t' '\n'
UTR5
ISG15(NM_005101:c.-33T>C)
1
948921
948921
T
C
comments: rs15842, a SNP in 5' UTR of ISG15
第二列为对第一列的描述信息,详情见下
值 | 默认优先级 | 解释 | 序列本体论 |
---|---|---|---|
exonic | 1 | variant overlaps a coding | exon_variant (SO:0001791) |
splicing | 1 | variant is within 2-bp of a splicing junction (use -splicing_threshold to change this) | splicing_variant (SO:0001568) |
ncRNA | 2 | variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation) | non_coding_transcript_variant (SO:0001619) |
UTR5 | 3 | variant overlaps a 5’ untranslated region | 5_prime_UTR_variant (SO:0001623) |
UTR3 | 3 | variant overlaps a 3’ untranslated region | 3_prime_UTR_variant (SO:0001624) |
intronic | 4 | variant overlaps an intron | intron_variant (SO:0001627) |
upstream | 5 | variant overlaps 1-kb region upstream of transcription start site | upstream_gene_variant (SO:0001631) |
downstream | 5 | variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this) | downstream_gene_variant (SO:0001632) |
intergenic | 6 | variant is in intergenic region | intergenic_variant (SO:0001628) |
第二个输出文件以.exonic_variant_function结尾,只列出外显子(氨基酸会改变)的变异
第一列为第一个文件中该变异所在的行号;
第二列为该变异的功能性后果,如外现在改变导致的氨基酸变化,阅读框移码等,详情见下
第三列为基因名称,转录识别标志和相应的转录本的序列变化
第四列为原输入文件内容
# ex1.exonic_variant_function
# 为方便了解每一列的含义,只输出第一行,然后把制表符改成换行符
[kaiwang@biocluster ~/]$ cat ex1.exonic_variant_function | head -n 1 | tr '\t' '\n'
line9
nonsynonymous SNV
IL23R:NM_144701:exon9:c.G1142A:p.R381Q,
1
67705958
67705958
G
A
comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
注释 | 优先级 | 解释 | 序列本体论 |
---|---|---|---|
frameshift insertion | 1 | an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence | frameshift_elongation (SO:0001909) |
frameshift deletion | 2 | a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence | frameshift_truncation (SO:0001910) |
frameshift block substitution | 3 | a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence | frameshift_variant (SO:0001589) |
stopgain | 4 | a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as “stopgain”! | stop_gained (SO:0001587) |
stoploss | 5 | a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site | stop_lost (SO:0001578) |
nonframeshift insertion | 6 | an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence | inframe_insertion (SO:0001821) |
nonframeshift deletion | 7 | a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence | inframe_deletion (SO:0001822) |
nonframeshift block substitution | 8 | a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence | inframe_variant (SO:0001650) |
nonsynonymous SNV | 9 | a single nucleotide change that cause an amino acid change | missense_variant (SO:0001583) |
synonymous SNV | 10 | a single nucleotide change that does not cause an amino acid change | synonymous_variant (SO:0001819) |
unknown | 11 | unknown function (due to various errors in the gene structure definition in the database file) | sequence_variant (SO:0001060) |
Region-based annotation
其与Gene-based annotation作用相反,它是用来确认在特定区域的突变造成的影响。比如在44个物种的保守基因区域,预测的转录因子结合区域,基因重复区域,GWAS分析区域,基因突变数据库,表观组学位点等。区别于filter-based annotation
此处以Conserved genomic elements annotation为例介绍region-based annotation的使用:
命令示例:
#数据库下载
[kaiwang@biocluster ~/]$ annotate_variation.pl -build hg19 -downdb phastConsElements46way humandb/
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory
#使用下载数据库进行注释
[kaiwang@biocluster ~/]$ annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way example/ex1.avinput humandb/
NOTICE: Reading annotation database humandb/hg19_phastConsElements46way.txt ... Done with 5163775 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.hg19.avinput
NOTICE: Output files were written to ex1.hg19_phastConsElements46way
# -regionanno 表示使用基于区域的注释
# -dbtype phastConsElements46way 表示使用"phastConsElements46way"数据库,注意需要使用Region-based的数据库
#输出文件
[kaiwang@biocluster ~/]$ cat ex1.hg19_phastConsElements46way
phastConsElements46way Score=387;Name=lod=50 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
phastConsElements46way Score=420;Name=lod=68 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
phastConsElements46way Score=385;Name=lod=49 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
phastConsElements46way Score=395;Name=lod=54 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
phastConsElements46way Score=545;Name=lod=218 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
输出的注释文件第1列为“phastConsElements46way”,对应注释的类型,这里的phastCons 46-way alignments属于保守的基因组区域的注释;
第二列包含评分和名称,评分来自UCSC,可以使用—score_threshold和—normscore_threshold来过滤评分低的变异,“Name=lod=x”名称表示该区域的名称;
剩余的部分为输入文件的内容。
Filter-based annotation
Filter-based annotation是用以确认已记录在特定数据库里的突变。例如想要知道突变是否为novel variation就需要知道该突变是否存在于dbSNP库里,它在1000 genome project里面等位基因频率怎样,以及计算一系列突变项目得分并加以过滤。它区别于region-based annotation就在于它针对突变碱基进行工作,而region-based annotation 针对染色体位置。举例来说就是region-based比对chr1:1000-1000而filter-based比对chr1:1000-1000上的A->G。
它拥有多种数据库,包括针对全基因组测序的突变频率,针对全外显子数据测序的突变频率,在孤立或者小类群人群中的突变频率,全基因组数据突变的功能预测,全外显子组突变的功能预测,剪切变异体的功能预测,疾病相关突变,突变确认等,如下:
下面给大家介绍常用的两种过滤注释
第一种: 1000 Genomes Project annotations
命令示例:
[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 -out ex1 example/ex1.avinput humandb/
NOTICE: Variants matching filtering criteria are written to ex1.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1.hg19_EUR.sites.2012_04_filtered
NOTICE: Processing next batch with 15 unique variants in 15 input lines
NOTICE: Database index loaded. Total number of bins is 2766067 and the number of bins to be scanned is 12
NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done
[kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped
1000g2012apr_eur 0.04 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1000g2012apr_eur 0.81 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g2012apr_eur 0.96 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 -out ex1 example/ex1.avinput humandb/
# -filter 使用基于过滤的注释
# -dbtype 1000g2012apr_eur 使用"1000g2012apr_eur"数据库
该注释使用2012年4月欧洲发布1000基因组计划数据库,输出文件会有两个,drfiltered file
#dropped file
[kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped
1000g2012apr_eur 0.04 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1000g2012apr_eur 0.81 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g2012apr_eur 0.96 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
#*dropped文件
第一列如region-based注释的结果一样以数据库命名;
第二列为等位基因频率,我们可以用-maf 0.05参数来过滤掉低于0.05的变异;
第三列开始同样是输入文件的内容。
#需要注意的是,我们也可以使用-maf 0.05 -reverse过滤掉高于0.05的变异;但是过滤ALT等位基因的频率,我们更提倡使用-score_threshold参数。
第二种: dbSNP annotations
通过dbsnp annotation, annovar可以确认已经出现在dbSNP数据库里面的突变并且注释SNP identifiers
命令如下:
#下载dbsnpp138数据库
[kaiwang@biocluster ~/]$ annotate_variation.pl -downdb -buildver hg19 -webfrom annovar snp138 humandb
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_snp138.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_snp138.txt.idx.gz ... OK
NOTICE: Uncompressing downloaded files NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
#使用dbsnp138注释
[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -out ex1 -build hg19 -dbtype snp138 example/ex1.avinput humandb/
NOTICE: Variants matching filtering criteria are written to ex1.hg19_snp138_dropped, other variants are written to ex1.hg19_snp138_filtered
NOTICE: Processing next batch with 15 unique variants in 15 input lines
NOTICE: Database index loaded. Total number of bins is 2858459 and the number of bins to be scanned is 12
NOTICE: Scanning filter database humandb/hg19_snp138.txt...Done
#输入dropped file
[kaiwang@biocluster ~/]$ cat ex1.hg19_snp138_dropped
snp138 rs35561142 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
snp138 rs149123833 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
snp138 rs1000050 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
snp138 rs1287637 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
snp138 rs11209026 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
snp138 rs6576700 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
snp138 rs15842 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
snp138 rs80338939 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
snp138 rs2066844 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
snp138 rs2066845 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
snp138 rs2066847 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
snp138 rs2241880 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
#*dropped文件
第一列如region-based注释的结果一样以数据库命名;
第二列为已经在数据库的突变的indentifier号;
第三列开始同样是输入文件的内容。
该注释使用2012年4月欧洲发布1000基因组计划数据库,输出文件会有两个, filtered file和droppend file
filtered file里面包含不在过滤数据库内的突变
当然ANNOVAR还有很多功能,我在这里只赘述我自己用到的一些方面,有兴趣的也可以自己去探究
参考文献:
ANNOVAR 注释软件:
ANNOVAR:
徐洲更的学习笔记
上面部分生信媛小伙伴Chevy所写的ANNOVAR使用教程。为了证明这个教程是多么靠谱,我看完之后,立刻拿出我的variant calling结果练习。
前期准备
构建拟南芥注释数据库
当我看完之后,我发现了一个巨大的问题,怎么都是人类,果蝇,小鼠,我可爱的拟南芥哪里去了!!!在官方文档的FAQ(How to handle E. coli, Arabidopsis thaliana and other genomes not in UCSC?)中提供了gene-based annotation构建注释数据库的方式。
手动构建数据库需要两个文件:
refGene: 很多物种可以直接从UCSC下载,但是我可爱的拟南芥还是没有,所以我需要用UCSC提供的软件转换GFF3格式
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/gff3ToGenePred chomod 770 gff3ToGenePred mv gff3ToGenePred ~/biosoft cd ~biosoft/annovar/ mkdir AthalianaDB ~/biosoft/gff3ToGenePred /path/to/Arabidopsis_thaliana.TAIR10.35.gff3 AthalianaDB/AT_refGene.txt
FASTA: 所有转录本的序列。使用retrieve_seq_from_fasta.pl根据refGene从基因组中提取转录本序列
perl retrieve_seq_from_fasta.pl --format refGene --seqfile $work/genome/A.thalina/TAIR10_chr_all.fa AthalianaDB/AT_refGene.txt AT_refGeneMrna.fa
转换vcf格式
由于要求输入文件格式为aviput,所以要用convert2annovar.pl
进行格式转化
~/biosoft/annovar/convert2annovar.pl -format vcf4 BC_bg_reads_bwa_raw_variants.vcf > BC_bg_reads_bwa_raw_variants.avinput
注释
前期工作已经,完成最后用### annotate_variation.pl
华丽的完成任务吧。
~/biosoft/annovar/annotate_variation.pl -buildver AT -geneanno BC_bg_reads_bwa_raw_variants.avinput ~/biosoft/annovar/AthalianaDB/
入门已经完成,后面就是不断重复,然后查询官方文档,熟练技术了。