查看原文
其他

宏基因组序列物种分类之kraken 1/2和Bracken的使用

宏基因组 2022-05-08

The following article is from 基因的生物信息学分析 Author 小飞侠fly

细菌基因组测序完,想看看样本有没有被其他的菌污染? 人的转录组测序完,想快速看看人、微生物的序列的比例? 元/宏基因组测序完,想快速获得样本中物种的丰度信息?

REFERENCE

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.

Kraken 1

Kraken 1是2014年Wood DE在Genome Biology发表的宏基因组序列分类软件,能够快速对宏基因样品中的reads进行分类。

Kraken在序列比对环节基于精确k-mer匹配和精简数据库的方法,采取精确匹配,其核心是Kraken有一种特殊数据库,用以预先计算序列中包含的特殊的Kmer序列。

下面是来自kraken官网关于各分类器的测评结果:

Kraken速度很快,精度较低,适用于做微生物检测的预处理。通过一些实际的数据测试发现:与Metaphlan2相比,Kraken速度较快,获得的物种数目较多,相对应的假阳性率也较高。

  1. Usage: kraken [options] <filename(s)>


  2. Options:

  3. --db NAME Name for Kraken DB

  4. (default: none)

  5. --threads NUM Number of threads (default: 1)

  6. --fasta-input Input is FASTA format

  7. --fastq-input Input is FASTQ format

  8. --fastq-output Output in FASTQ format

  9. --gzip-compressed Input is gzip compressed

  10. --bzip2-compressed Input is bzip2 compressed

  11. --quick Quick operation (use first hit or hits)

  12. --min-hits NUM In quick op., number of hits req'd for classification

  13. NOTE: this is ignored if --quick is not specified

  14. --unclassified-out FILENAME

  15. Print unclassified sequences to filename

  16. --classified-out FILENAME

  17. Print classified sequences to filename

  18. --out-fmt FORMAT Format for [un]classified sequence output. supported

  19. options are: {legacy, paired, interleaved}

  20. --output FILENAME Print output to filename (default: stdout); "-" will

  21. suppress normal output

  22. --only-classified-output

  23. Print no Kraken output for unclassified sequences

  24. --preload Loads DB into memory before classification

  25. --paired The two filenames provided are paired-end reads

  26. --check-names Ensure each pair of reads have names that agree

  27. with each other; ignored if --paired is not specified

  28. --help Print this message

  29. --version Print version information


  30. If none of the *-input or *-compressed flags are specified, and the

  31. file is a regular file, automatic format detection is attempted.

  1. $ kraken --threads 40 --db minikraken_20171013_4GB --preload --

  2. paired --fastq-input --gzip-compressed ${B}_1.fastq.gz ${B}_2.fastq.gz | kraken-report --db minikraken_20171013_4GB > "$B"_kraken.tab

  1. $ less -S "$B"_kraken.tab

  2. 50.30 27375076 27375076 U 0 unclassified

  3. 49.70 27047187 13633 - 1 root

  4. 49.35 26857009 348 - 131567 cellular organisms

  5. 49.35 26855582 105532 D 2 Bacteria

  6. 33.57 18270135 0 - 1783270 FCB group

  7. 33.57 18269977 5 - 68336 Bacteroidetes/Chlorobi group

  8. 33.57 18269761 107058 P 976 Bacteroidetes

  9. 33.34 18144208 69 C 200643 Bacteroidia

  10. 33.34 18144105 961382 O 171549 Bacteroidales

  11. 30.47 16580771 0 F 815 Bacteroidaceae

  12. 30.47 16580771 2380340 G 816 Bacteroides

  13. 22.89 12454849 9623614 S 821 Bacteroides vulgatus

  14. 5.20 2831235 2831235 - 435590 Bacteroides vulgatus ATCC 8482

  15. 0.73 396849 0 S 357276 Bacteroides dorei

  16. 0.73 396849 396849 - 997877 Bacteroides dorei CL03T12C01

  17. 0.60 326487 326471 S 28116 Bacteroides ovatus

  18. 0.00 16 16 - 1379690 Bacteroides ovatus V975

  19. 0.42 227354 141795 S 818 Bacteroides thetaiotaomicron

  20. 0.16 85559 85559 - 226186 Bacteroides thetaiotaomicron VPI-5482

  21. 0.38 206113 206113 S 1796613 Bacteroides caecimuris

  22. 0.33 180286 180286 S 47678 Bacteroides caccae

  23. 0.29 156976 68554 S 817 Bacteroides fragilis

  24. 0.10 55847 55847 - 862962 Bacteroides fragilis 638R

  25. 0.03 17859 17859 - 295405 Bacteroides fragilis YCH46

  26. 0.03 14716 14716 - 272559 Bacteroides fragilis NCTC 9343

如上是Kraken的结果,可以看出它没有估算出物种的丰度。 

Bracken

这时可以使用另一款软件Bracken (Bayesian Reestimation of Abundance with KrakEN),它是一种从宏基因组数据中高度准确的计算物种丰度的统计方法。

  1. $ bracken -h

  2. Usage: bracken -d MY_DB -i INPUT -o OUTPUT -r READ_LEN -l LEVEL -t THRESHOLD

  3. MY_DB location of Kraken database

  4. INPUT Kraken REPORT file to use for abundance estimation

  5. OUTPUT file name for Bracken default output

  6. READ_LEN read length to get all classifications for (default: 100)

  7. LEVEL level to estimate abundance at [options: D,P,C,O,F,G,S] (default: S)

  8. THRESHOLD number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)

  9. $ bracken -d minikraken_20171013_4GB/ -i $B\_kraken.tab -t 10 -o $B.out

  10. #获得如下结果:

  11. name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads

  12. Uncultured phage WW-nAnB strain 2 1449896 S 388 113 501 0.00007

  13. Uncultured phage WW-nAnB strain 3 1449897 S 349 131 480 0.00007

  14. Aureimonas sp. AU20 1349819 S 108 39 147 0.00002

  15. Phaeobacter piscinae 1580596 S 21 4 25 0.00000

  16. Sinorhizobium sp. CCBAU 05631 794846 S 25 12 37 0.00001

  17. Mucilaginibacter sp. BJC16-A31 1234841 S 31 0 31 0.00000

  18. Arcanobacterium phocae 131112 S 34 0 34 0.00000

  19. Kineococcus radiotolerans 131568 S 82 3 85 0.00001

  20. Actinomyces radingae 131110 S 314 5 319 0.00005

  21. Sediminicola sp. YIK13 1453352 S 19 0 19 0.00000

  22. Methylotenera mobilis 359408 S 42 1 43 0.00001

  23. Stenotrophomonas rhizophila 216778 S 76 52 128 0.00002

  24. Acholeplasma oculi 35623 S 12 0 12 0.00000

  25. Dictyoglomus turgidum 513050 S 12 0 12 0.00000

  26. Chelatococcus sp.

NOTE: Kraken 2 is the newest version of Kraken (See Kraken 2's Webpage for details). Kraken 1 will continue to be available via the Kraken 1 Github page, but it is no longer being supported.

Kraken 2

与Kraken 1相比,Kraken 2有了很大的改进:

  1. 更快速的构建数据库

  2. 数据库的占用存储空间更少

  3. 更快的分类速度

还能支持 16S Databases包括Greengenes, SILVA,和 RDP。

  1. $ kraken2

  2. Need to specify input filenames!

  3. Usage: kraken2 [options] <filename(s)>


  4. Options:

  5. --db NAME Name for Kraken 2 DB

  6. (default: none)

  7. --threads NUM Number of threads (default: 1)

  8. --quick Quick operation (use first hit or hits)

  9. --unclassified-out FILENAME

  10. Print unclassified sequences to filename

  11. --classified-out FILENAME

  12. Print classified sequences to filename

  13. --output FILENAME Print output to filename (default: stdout); "-" will

  14. suppress normal output

  15. --confidence FLOAT Confidence score threshold (default: 0.0); must be

  16. in [0, 1].

  17. --minimum-base-quality NUM

  18. Minimum base quality used in classification (def: 0,

  19. only effective with FASTQ input).

  20. --report FILENAME Print a report with aggregrate counts/clade to file

  21. --use-mpa-style With --report, format report output like Kraken 1's

  22. kraken-mpa-report

  23. --report-zero-counts With --report, report counts for ALL taxa, even if

  24. counts are zero

  25. --memory-mapping Avoids loading database into RAM

  26. --paired The filenames provided have paired-end reads

  27. --use-names Print scientific names instead of just taxids

  28. --gzip-compressed Input files are compressed with gzip

  29. --bzip2-compressed Input files are compressed with bzip2

  30. --help Print this message

  31. --version Print version information


  32. If none of the *-compressed flags are specified, and the filename provided

  33. is a regular file, automatic format detection is attempted.

  1. $ kraken2\

  2. --db minikraken2_v2_8GB_201904_UPDATE/ \

  3. --threads 20 \

  4. --report report \

  5. --gzip-compressed --paired \

  6. ${B}_1.fastq.gz ${B}_2.fastq.gz

bracken也支持kraken2的结果。



关阅读

肠道菌群:16S测序分析流程解读

肠道菌群:宏基因组测序分析流程解读

肠道菌群:宏转录组测序分析流程解读

肠道菌群:宏病毒组测序分析流程解读

猜你喜欢

10000+:菌群分析 宝宝与猫狗 梅毒狂想曲 提DNA发Nature Cell专刊 肠道指挥大脑

系列教程:微生物组入门 Biostar 微生物组  宏基因组

专业技能:学术图表 高分文章 生信宝典 不可或缺的人

一文读懂:宏基因组 寄生虫益处 进化树

必备技能:提问 搜索  Endnote

文献阅读 热心肠 SemanticScholar Geenmedical

扩增子分析:图表解读 分析流程 统计绘图

16S功能预测   PICRUSt  FAPROTAX  Bugbase Tax4Fun

在线工具:16S预测培养基 生信绘图

科研经验:云笔记  云协作 公众号

编程模板: Shell  R Perl

生物科普:  肠道细菌 人体上的生命 生命大跃进  细胞暗战 人体奥秘  

写在后面

为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外5000+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份,另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助,首先阅读《如何优雅的提问》学习解决问题思路,仍未解决群内讨论,问题不私聊,帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”

点击阅读原文,跳转最新文章目录阅读

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存