宏基因组序列物种分类之kraken 1/2和Bracken的使用

宏基因组 2022-05-08

The following article is from 基因的生物信息学分析 Author 小飞侠fly

细菌基因组测序完，想看看样本有没有被其他的菌污染? 人的转录组测序完，想快速看看人、微生物的序列的比例? 元/宏基因组测序完，想快速获得样本中物种的丰度信息？

REFERENCE

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.

Kraken 1

Kraken 1是2014年Wood DE在Genome Biology发表的宏基因组序列分类软件，能够快速对宏基因样品中的reads进行分类。

Kraken在序列比对环节基于精确k-mer匹配和精简数据库的方法，采取精确匹配，其核心是Kraken有一种特殊数据库，用以预先计算序列中包含的特殊的Kmer序列。

下面是来自kraken官网关于各分类器的测评结果：

Kraken速度很快，精度较低，适用于做微生物检测的预处理。通过一些实际的数据测试发现：与Metaphlan2相比，Kraken速度较快，获得的物种数目较多，相对应的假阳性率也较高。

Usage: kraken [options] <filename(s)>
Options:
--db NAME Name for Kraken DB
(default: none)
--threads NUM Number of threads (default: 1)
--fasta-input Input is FASTA format
--fastq-input Input is FASTQ format
--fastq-output Output in FASTQ format
--gzip-compressed Input is gzip compressed
--bzip2-compressed Input is bzip2 compressed
--quick Quick operation (use first hit or hits)
--min-hits NUM In quick op., number of hits req'd for classification
NOTE: this is ignored if --quick is not specified
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--out-fmt FORMAT Format for [un]classified sequence output. supported
options are: {legacy, paired, interleaved}
--output FILENAME Print output to filename (default: stdout); "-" will
suppress normal output
--only-classified-output
Print no Kraken output for unclassified sequences
--preload Loads DB into memory before classification
--paired The two filenames provided are paired-end reads
--check-names Ensure each pair of reads have names that agree
with each other; ignored if --paired is not specified
--help Print this message
--version Print version information
If none of the *-input or *-compressed flags are specified, and the
file is a regular file, automatic format detection is attempted.

$ kraken --threads 40 --db minikraken_20171013_4GB --preload --
paired --fastq-input --gzip-compressed ${B}_1.fastq.gz ${B}_2.fastq.gz | kraken-report --db minikraken_20171013_4GB > "$B"_kraken.tab

$ less -S "$B"_kraken.tab
50.30 27375076 27375076 U 0 unclassified
49.70 27047187 13633 - 1 root
49.35 26857009 348 - 131567 cellular organisms
49.35 26855582 105532 D 2 Bacteria
33.57 18270135 0 - 1783270 FCB group
33.57 18269977 5 - 68336 Bacteroidetes/Chlorobi group
33.57 18269761 107058 P 976 Bacteroidetes
33.34 18144208 69 C 200643 Bacteroidia
33.34 18144105 961382 O 171549 Bacteroidales
30.47 16580771 0 F 815 Bacteroidaceae
30.47 16580771 2380340 G 816 Bacteroides
22.89 12454849 9623614 S 821 Bacteroides vulgatus
5.20 2831235 2831235 - 435590 Bacteroides vulgatus ATCC 8482
0.73 396849 0 S 357276 Bacteroides dorei
0.73 396849 396849 - 997877 Bacteroides dorei CL03T12C01
0.60 326487 326471 S 28116 Bacteroides ovatus
0.00 16 16 - 1379690 Bacteroides ovatus V975
0.42 227354 141795 S 818 Bacteroides thetaiotaomicron
0.16 85559 85559 - 226186 Bacteroides thetaiotaomicron VPI-5482
0.38 206113 206113 S 1796613 Bacteroides caecimuris
0.33 180286 180286 S 47678 Bacteroides caccae
0.29 156976 68554 S 817 Bacteroides fragilis
0.10 55847 55847 - 862962 Bacteroides fragilis 638R
0.03 17859 17859 - 295405 Bacteroides fragilis YCH46
0.03 14716 14716 - 272559 Bacteroides fragilis NCTC 9343

如上是Kraken的结果，可以看出它没有估算出物种的丰度。

Bracken

这时可以使用另一款软件Bracken (Bayesian Reestimation of Abundance with KrakEN)，它是一种从宏基因组数据中高度准确的计算物种丰度的统计方法。

$ bracken -h
Usage: bracken -d MY_DB -i INPUT -o OUTPUT -r READ_LEN -l LEVEL -t THRESHOLD
MY_DB location of Kraken database
INPUT Kraken REPORT file to use for abundance estimation
OUTPUT file name for Bracken default output
READ_LEN read length to get all classifications for (default: 100)
LEVEL level to estimate abundance at [options: D,P,C,O,F,G,S] (default: S)
THRESHOLD number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)
$ bracken -d minikraken_20171013_4GB/ -i $B\_kraken.tab -t 10 -o $B.out
#获得如下结果：
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
Uncultured phage WW-nAnB strain 2 1449896 S 388 113 501 0.00007
Uncultured phage WW-nAnB strain 3 1449897 S 349 131 480 0.00007
Aureimonas sp. AU20 1349819 S 108 39 147 0.00002
Phaeobacter piscinae 1580596 S 21 4 25 0.00000
Sinorhizobium sp. CCBAU 05631 794846 S 25 12 37 0.00001
Mucilaginibacter sp. BJC16-A31 1234841 S 31 0 31 0.00000
Arcanobacterium phocae 131112 S 34 0 34 0.00000
Kineococcus radiotolerans 131568 S 82 3 85 0.00001
Actinomyces radingae 131110 S 314 5 319 0.00005
Sediminicola sp. YIK13 1453352 S 19 0 19 0.00000
Methylotenera mobilis 359408 S 42 1 43 0.00001
Stenotrophomonas rhizophila 216778 S 76 52 128 0.00002
Acholeplasma oculi 35623 S 12 0 12 0.00000
Dictyoglomus turgidum 513050 S 12 0 12 0.00000
Chelatococcus sp.

NOTE: Kraken 2 is the newest version of Kraken (See Kraken 2's Webpage for details). Kraken 1 will continue to be available via the Kraken 1 Github page, but it is no longer being supported.

Kraken 2

与Kraken 1相比，Kraken 2有了很大的改进：

更快速的构建数据库
数据库的占用存储空间更少
更快的分类速度

还能支持 16S Databases包括Greengenes, SILVA,和 RDP。

$ kraken2
Need to specify input filenames!
Usage: kraken2 [options] <filename(s)>
Options:
--db NAME Name for Kraken 2 DB
(default: none)
--threads NUM Number of threads (default: 1)
--quick Quick operation (use first hit or hits)
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--output FILENAME Print output to filename (default: stdout); "-" will
suppress normal output
--confidence FLOAT Confidence score threshold (default: 0.0); must be
in [0, 1].
--minimum-base-quality NUM
Minimum base quality used in classification (def: 0,
only effective with FASTQ input).
--report FILENAME Print a report with aggregrate counts/clade to file
--use-mpa-style With --report, format report output like Kraken 1's
kraken-mpa-report
--report-zero-counts With --report, report counts for ALL taxa, even if
counts are zero
--memory-mapping Avoids loading database into RAM
--paired The filenames provided have paired-end reads
--use-names Print scientific names instead of just taxids
--gzip-compressed Input files are compressed with gzip
--bzip2-compressed Input files are compressed with bzip2
--help Print this message
--version Print version information
If none of the *-compressed flags are specified, and the filename provided
is a regular file, automatic format detection is attempted.

$ kraken2\
--db minikraken2_v2_8GB_201904_UPDATE/ \
--threads 20 \
--report report \
--gzip-compressed --paired \
${B}_1.fastq.gz ${B}_2.fastq.gz

bracken也支持kraken2的结果。

系列教程：微生物组入门 Biostar 微生物组宏基因组

专业技能：学术图表高分文章生信宝典不可或缺的人

一文读懂：宏基因组寄生虫益处进化树

必备技能：提问搜索 Endnote

文献阅读热心肠 SemanticScholar Geenmedical

扩增子分析：图表解读分析流程统计绘图

16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun

在线工具：16S预测培养基生信绘图

科研经验：云笔记云协作公众号

编程模板: Shell R Perl

生物科普: 肠道细菌人体上的生命生命大跃进细胞暗战人体奥秘

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份，另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍未解决群内讨论，问题不私聊，帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

宏基因组序列物种分类之kraken 1/2和Bracken的使用

REFERENCE

Kraken 1

Bracken

Kraken 2

猜你喜欢

写在后面

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

生成图片，分享到微信朋友圈

宏基因组序列物种分类之kraken 1/2和Bracken的使用

REFERENCE

Kraken 1

Bracken

Kraken 2

猜你喜欢

写在后面

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡