宏基因组序列物种分类之kraken 1/2和Bracken的使用
The following article is from 基因的生物信息学分析 Author 小飞侠fly
细菌基因组测序完,想看看样本有没有被其他的菌污染? 人的转录组测序完,想快速看看人、微生物的序列的比例? 元/宏基因组测序完,想快速获得样本中物种的丰度信息?
REFERENCE
Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.
Kraken 1
Kraken 1是2014年Wood DE在Genome Biology发表的宏基因组序列分类软件,能够快速对宏基因样品中的reads进行分类。
Kraken在序列比对环节基于精确k-mer匹配和精简数据库的方法,采取精确匹配,其核心是Kraken有一种特殊数据库,用以预先计算序列中包含的特殊的Kmer序列。
下面是来自kraken官网关于各分类器的测评结果:
Kraken速度很快,精度较低,适用于做微生物检测的预处理。通过一些实际的数据测试发现:与Metaphlan2相比,Kraken速度较快,获得的物种数目较多,相对应的假阳性率也较高。
Usage: kraken [options] <filename(s)>
Options:
--db NAME Name for Kraken DB
(default: none)
--threads NUM Number of threads (default: 1)
--fasta-input Input is FASTA format
--fastq-input Input is FASTQ format
--fastq-output Output in FASTQ format
--gzip-compressed Input is gzip compressed
--bzip2-compressed Input is bzip2 compressed
--quick Quick operation (use first hit or hits)
--min-hits NUM In quick op., number of hits req'd for classification
NOTE: this is ignored if --quick is not specified
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--out-fmt FORMAT Format for [un]classified sequence output. supported
options are: {legacy, paired, interleaved}
--output FILENAME Print output to filename (default: stdout); "-" will
suppress normal output
--only-classified-output
Print no Kraken output for unclassified sequences
--preload Loads DB into memory before classification
--paired The two filenames provided are paired-end reads
--check-names Ensure each pair of reads have names that agree
with each other; ignored if --paired is not specified
--help Print this message
--version Print version information
If none of the *-input or *-compressed flags are specified, and the
file is a regular file, automatic format detection is attempted.
$ kraken --threads 40 --db minikraken_20171013_4GB --preload --
paired --fastq-input --gzip-compressed ${B}_1.fastq.gz ${B}_2.fastq.gz | kraken-report --db minikraken_20171013_4GB > "$B"_kraken.tab
$ less -S "$B"_kraken.tab
50.30 27375076 27375076 U 0 unclassified
49.70 27047187 13633 - 1 root
49.35 26857009 348 - 131567 cellular organisms
49.35 26855582 105532 D 2 Bacteria
33.57 18270135 0 - 1783270 FCB group
33.57 18269977 5 - 68336 Bacteroidetes/Chlorobi group
33.57 18269761 107058 P 976 Bacteroidetes
33.34 18144208 69 C 200643 Bacteroidia
33.34 18144105 961382 O 171549 Bacteroidales
30.47 16580771 0 F 815 Bacteroidaceae
30.47 16580771 2380340 G 816 Bacteroides
22.89 12454849 9623614 S 821 Bacteroides vulgatus
5.20 2831235 2831235 - 435590 Bacteroides vulgatus ATCC 8482
0.73 396849 0 S 357276 Bacteroides dorei
0.73 396849 396849 - 997877 Bacteroides dorei CL03T12C01
0.60 326487 326471 S 28116 Bacteroides ovatus
0.00 16 16 - 1379690 Bacteroides ovatus V975
0.42 227354 141795 S 818 Bacteroides thetaiotaomicron
0.16 85559 85559 - 226186 Bacteroides thetaiotaomicron VPI-5482
0.38 206113 206113 S 1796613 Bacteroides caecimuris
0.33 180286 180286 S 47678 Bacteroides caccae
0.29 156976 68554 S 817 Bacteroides fragilis
0.10 55847 55847 - 862962 Bacteroides fragilis 638R
0.03 17859 17859 - 295405 Bacteroides fragilis YCH46
0.03 14716 14716 - 272559 Bacteroides fragilis NCTC 9343
如上是Kraken的结果,可以看出它没有估算出物种的丰度。
Bracken
这时可以使用另一款软件Bracken (Bayesian Reestimation of Abundance with KrakEN),它是一种从宏基因组数据中高度准确的计算物种丰度的统计方法。
$ bracken -h
Usage: bracken -d MY_DB -i INPUT -o OUTPUT -r READ_LEN -l LEVEL -t THRESHOLD
MY_DB location of Kraken database
INPUT Kraken REPORT file to use for abundance estimation
OUTPUT file name for Bracken default output
READ_LEN read length to get all classifications for (default: 100)
LEVEL level to estimate abundance at [options: D,P,C,O,F,G,S] (default: S)
THRESHOLD number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)
$ bracken -d minikraken_20171013_4GB/ -i $B\_kraken.tab -t 10 -o $B.out
#获得如下结果:
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
Uncultured phage WW-nAnB strain 2 1449896 S 388 113 501 0.00007
Uncultured phage WW-nAnB strain 3 1449897 S 349 131 480 0.00007
Aureimonas sp. AU20 1349819 S 108 39 147 0.00002
Phaeobacter piscinae 1580596 S 21 4 25 0.00000
Sinorhizobium sp. CCBAU 05631 794846 S 25 12 37 0.00001
Mucilaginibacter sp. BJC16-A31 1234841 S 31 0 31 0.00000
Arcanobacterium phocae 131112 S 34 0 34 0.00000
Kineococcus radiotolerans 131568 S 82 3 85 0.00001
Actinomyces radingae 131110 S 314 5 319 0.00005
Sediminicola sp. YIK13 1453352 S 19 0 19 0.00000
Methylotenera mobilis 359408 S 42 1 43 0.00001
Stenotrophomonas rhizophila 216778 S 76 52 128 0.00002
Acholeplasma oculi 35623 S 12 0 12 0.00000
Dictyoglomus turgidum 513050 S 12 0 12 0.00000
Chelatococcus sp.
NOTE: Kraken 2 is the newest version of Kraken (See Kraken 2's Webpage for details). Kraken 1 will continue to be available via the Kraken 1 Github page, but it is no longer being supported.
Kraken 2
与Kraken 1相比,Kraken 2有了很大的改进:
更快速的构建数据库
数据库的占用存储空间更少
更快的分类速度
还能支持 16S Databases包括Greengenes, SILVA,和 RDP。
$ kraken2
Need to specify input filenames!
Usage: kraken2 [options] <filename(s)>
Options:
--db NAME Name for Kraken 2 DB
(default: none)
--threads NUM Number of threads (default: 1)
--quick Quick operation (use first hit or hits)
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--output FILENAME Print output to filename (default: stdout); "-" will
suppress normal output
--confidence FLOAT Confidence score threshold (default: 0.0); must be
in [0, 1].
--minimum-base-quality NUM
Minimum base quality used in classification (def: 0,
only effective with FASTQ input).
--report FILENAME Print a report with aggregrate counts/clade to file
--use-mpa-style With --report, format report output like Kraken 1's
kraken-mpa-report
--report-zero-counts With --report, report counts for ALL taxa, even if
counts are zero
--memory-mapping Avoids loading database into RAM
--paired The filenames provided have paired-end reads
--use-names Print scientific names instead of just taxids
--gzip-compressed Input files are compressed with gzip
--bzip2-compressed Input files are compressed with bzip2
--help Print this message
--version Print version information
If none of the *-compressed flags are specified, and the filename provided
is a regular file, automatic format detection is attempted.
$ kraken2\
--db minikraken2_v2_8GB_201904_UPDATE/ \
--threads 20 \
--report report \
--gzip-compressed --paired \
${B}_1.fastq.gz ${B}_2.fastq.gz
bracken也支持kraken2的结果。
相关阅读
猜你喜欢
10000+:菌群分析 宝宝与猫狗 梅毒狂想曲 提DNA发Nature Cell专刊 肠道指挥大脑
文献阅读 热心肠 SemanticScholar Geenmedical
16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun
生物科普: 肠道细菌 人体上的生命 生命大跃进 细胞暗战 人体奥秘
写在后面
为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外5000+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份,另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助,首先阅读《如何优雅的提问》学习解决问题思路,仍未解决群内讨论,问题不私聊,帮助同行。
学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”