查看原文
其他

9个模块+40余款软件+老司机辣评 | 16S信息分析流程软件和数据库合集

Sonia 生信者言 2022-03-28


16S测序,也即是扩增子测序,因为其“短平快”、“物美价廉”的特点,目前可谓是科研工作者们最为喜闻乐见的高通量测序类型了。


由于其数据量很小,越来越多没有HPC的宝宝们都可以用小通量的服务器甚至是好的笔记本来自己作数据分析了。


也因此,扩增子的软件层出不穷,从集成的傻瓜式分析软件,到各种解决特定小问题的软件和小工具,林林总总上百种。这里就给大家盘点一些主流的软件和数据库,并稍作点评,欢迎补充、指正。


01

流程集成

1、QIIME

QIIME is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data. QIIME includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations.


最新版本:

QIIME2(2018年1月1日后QIIME1将不再支持和更新)

参考文献:PMID:20383131

下载地址:

https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc

官网地址:

QIIME2: https://docs.qiime2.org/2017.8/

QIIME1:  http://qiime.org/

流程示例地址:

https://docs.qiime2.org/2017.8/tutorials/moving-pictures/

2、Mothur

Mothur is currently the most cited bioinformatics tool for analyzing 16S rRNA gene sequences. Step inside the wiki and user forum and learn how you can use mothur to process data generated by Sanger, PacBio, IonTorrent, 454, and Illumina (MiSeq/HiSeq). 


最新版本:Version 1.39.5

参考文献:PMID:19801464

下载地址:

https://github.com/mothur/mothur/releases/tag/v1.39.5

官网地址:

https://www.mothur.org/

流程示例地址:

https://www.mothur.org/wiki/MiSeq_SOP

3、Usearch

USEARCH is a unique sequence analysis tool with thousands of users world-wide, which combines many different algorithms into a single package with outstanding documentation and support. 


最新版本:Version 10

参考文献:PMID:20709691

下载地址:

http://drive5.com/usearch/download.html

官网地址:

http://drive5.com/usearch/

4、FunGene

Functional Gene Pipeline Scripts contains a set of python scripts that allows to run one or more individual tools offered by RDP FunGene Pipeline. These tools are offered a modular fashion allowing researchers to choose the appropriate subset based on their needs.


最新版本:Version 9.3

参考文献:PMID:24101916

官网地址:

http://fungene.cme.msu.edu/

流程示例地址:

http://fungene.cme.msu.edu/FunGenePipeline/

5、SILVAngs

SILVAngs is a data analysis service for ribosomal RNA gene (rDNA) amplicon reads from high-throughput sequencing approaches based on an automatic software pipeline. It uses the SILVA rDNA databases, taxonomies, and alignments as a reference. It facilitates the classification of rDNA reads and provides a wealth of results (tables, graphs and sequence files) for download.


最新版本:Version 9.3

参考文献:PMID:23193283

官网地址:

https://www.arb-silva.de/ngs/

流程示例地址:

https://www.arbsilva.de/ngs/#demo:

🚗  🚗  老司机点评:在扩增子数据分析中,分析点相对成熟,软件繁多,盘点下来不止百种。一一安装又浪费资源又浪费时间,打包了多种软件的流程式软件备受青睐。这其中最为有名的便是QIIME和Mothur, 基本上可能用到的分析点大多都打包进去了。老牌聚类软件usearch不落人后,也将数据前处理、OTU聚类、物种注释、多样性分析等一并打包进去,虽则不像qiime中花样繁多,基本上的分析也够了,唯一可惜的是64位版本收费。一些数据库如RDP和SILVA等也纷纷动作,如SILVAngs的在线分析平台,FunGene的功能基因分析流程,RDP自己的rdpipeline(http://pyro.cme.msu.edu/)等,这里不一一列举。


02

数据质控

1、FastQC

A quality control tool for high throughput sequence data.


最新版本:Version 0.11.5

参考文献:PMID:22312429

下载地址:

https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc

官网地址:

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

2、Trimmomatic

A flexible trimmer for Illumina Sequence Data


最新版本:Version 0.36

参考文献:PMID:24695404

下载地址:

http://www.usadellab.org/cms/?page=trimmomatic

3、QIIME split_libraries_fastq.py


软件地址:http://qiime.org/

命令使用说明:

http://qiime.org/scripts/split_libraries_fastq.html

🚗  🚗  老司机点评:扩增子的数据质控在分析的好几个地方都会用到,从原始数据下机,先要经历质控的环节,序列首先截掉接头、barcode、引物,做个质量评价和过滤,根据 PE reads的overlap拼接在一起,然后还要经历拼接后序列的质控,去除低质量、读N、过段序列,然后才能用于后续的聚类和注释分析。这里把质控的部分都放到一块来写。FastQC这个软件在《NGS数据格式演化简史》里面介绍过,基本上是原始数据质控的标配了。Trimmomatic是一个划动窗口的过滤和截断软件,对illumina这种序列尾部质量显著下降的很有用。拼接后序列的过滤在QIIME中有自编脚本,可调用执行。


03

Reads拼接

1、FLASH

A very fast and accurate software tool to merge paired-end reads  from  NGS experiments.


最新版本:Version 1.2.11

参考文献:PMID:21903629

下载地址:

https://sourceforge.net/projects/flashpage/files/

官网地址:

https://ccb.jhu.edu/software/FLASH/

2、PEAR 

An ultrafast, memory-efficient and highly accurate pair-end read  merger. It is fully parallelized and can run with as low as just a few  kilobytes of memory.


最新版本:Version 0.9.8

参考文献:PMID: 24142950

下载地址:

https://sco.hits.org/exelixis/web/software/pear/downloads.html

官网地址:

https://sco.hits.org/exelixis/web/software/pear/

3、PANDAseq

PANDAseq assembles paired-end reads rapidly and with the correction of most errors. Uncertain error corrections come from reads with many low-quality bases identified by upstream processing. 


最新版本:Version 2.11

参考文献:PMID:22333067

下载地址:

https://github.com/neufeld/pandaseq/releases/tag/v2.11

官网地址:

http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

4、fastq-jion

Command-line tools for processing biological sequencing data


参考文献:

Command-line tools for processing biological sequencing data

官网地址:

https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqJoin.md

🚗  🚗  老司机点评:目前最为主流的拼接软件仍为flash,但如果扩增片段过长或过短时,flash拼接效果可能不尽如人意,针对这些情况用pear或pandaseq拼接可能会有惊喜。

fastq-join是打包在qiime中的拼接软件,在qiime中运行join_paired_ends.py默认调用fastq-join,可选其他软件如SeqPrep(https://github.com/jstjohn/SeqPrep),后者目前还比较少在文献中出现,运算速度上这两个软件还是不错的


04

嵌合体去除

1、DECIPHER

DECIPHER is a software toolset that can be used for deciphering and managing biological sequences efficiently using the R programming language. DECIPHER's Find Chimeras web tool can be used to uncover chimeras hidden in 16S rRNA sequences.


最新版本:Version 2.2.0

参考文献:PMID:22101057

下载地址:

http://decipher.cee.wisc.edu/Download.html

官网地址:

http://decipher.cee.wisc.edu/index.html

2、ChimeraSlayer

ChimeraSlayer uses BLAST to identify potential chimera parents and computes the optimal branching alignment of the query against two parents. An input with the pynast aligned representative sequences is suggested.  


最新版本:Version 2.2.0

参考文献:PMID:21212162

下载地址:

https://sourceforge.net/projects/microbiomeutil/files/

官网地址:

http://microbiomeutil.sourceforge.net/#A_CS

3、VSEARCH

VSEARCH supports de novo and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads.


最新版本:Version 2.4.4

参考文献:PMID: 27781170

下载地址:

https://github.com/torognes/vsearch/releases

官网地址:

https://github.com/torognes/vsearch

4、UCHIME2

UCHIME2 and UCHIME are algorithms for detecting chimeric sequences.


最新版本:Version 4.2

参考文献:

doi: https://doi.org/10.1101/074252

下载地址:

http://drive5.com/uchime/uchime_download.html

官网地址: 

http://drive5.com/usearch/manual/uchime_algo.html

5、usearch61

usearch61 performs both de novo (abundance based) chimera and reference based detection. With usearch61, unclustered sequences should be used as input rather than a representative sequence set, as these sequences need to be clustered to get abundance data.


参考文献:PMID:20709691

下载地址:

http://drive5.com/usearch/download.html

官网地址:

http://drive5.com/usearch/usearch_docs.html

🚗  🚗  老司机点评:嵌合体的去除主要是de novo和基于参考库两种方法,结合了两种方法的usearch61被打包在qiime中(identify_chimeric_seqs.py),是目前主流的方法之一。但是注意,上面说到过,usearch的64位版本是收费的!前几年专门用uchime去嵌合体也应用较多,但现在官网上已指出不推荐单独安装uchime,推荐直接下载usearch。VSEARCH是作为替代usearch的开源软件推出的,与usearch运算速度不分上下,是mothur中嵌合体去除和聚类的推荐方法,建议大家可以试试。ChimeraSlayer运算速度较慢,DECIPHER已经在uchime官网里被吊打,这里不做推荐。


05

OTU聚类

1、UCLUST

UCLUST creates “seeds” of sequences which generate clusters based on percent identity. Uclust_ref, as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off.


参考文献:PMID:20709691

下载地址:

http://www.drive5.com/uclust/downloads1_2_22q.html

官网地址:

https://www.drive5.com/usearch/manual/uclust_algo.html

2、Uparse

UPARSE is a method for generating clusters (OTUs) from next-generation sequencing reads of marker genes such as 16S rRNA, the fungal ITS region and the COI gene. 


参考文献:PMID:23955772

下载地址:

http://www.drive5.com/usearch/manual/cmd_cluster_otus.html

官网地址:

https://www.drive5.com/uparse/

3、CD-HIT

CD-HIT is a very widely used clustering program, which applies a “longest-sequence-first list removal algorithm” to cluster sequences.


最新版本:Version 4.6.8

参考文献:PMID:23060610

下载地址:

https://github.com/weizhongli/cdhit/releases

官网地址:

http://weizhongli-lab.org/cd-hit/

4、Mothur

For the Mothur method, the clustering algorithm may be specified as nearest-neighbor, furthest-neighbor, or average-neighbor. The default algorithm is furthest-neighbor.


详见第一部分介绍

5、Oclust

A pipeline for clustering long 16S rRNA sequencing reads, or any sequences, into OTUs.


参考文献:PMID: 26434730

下载地址:

https://github.com/oscar-franzen/oclust/

官网地址:

https://omictools.com/oclust-tool

🚗  🚗  老司机点评:OTUs聚类的方法有非常多,主要分为启发式算法和层次聚类算法两种,前者有uparse、uclust、CD-HIT等,后者如mothur和oclust等。从应用情况来看,目前主流上的聚类软件还是以uparse、uclust、mothur几种为主。上面提到的软件,大多都有打包在qiime中,默认聚类软件是uclust(pick_otus.py)。最后列出的Oclust主打基于三代Pacbio长序列的聚类,鉴于目前二代测序独领风骚的局面,目前应用尚且较少。


06

物种注释

1、Greengenes

A 16S rRNA gene database addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. 


最新版本:Version 13.5

参考文献:PMID: 16820507 

下载地址:

http://greengenes.secondgenome.com/downloads/database/13_5

官网地址:

http://greengenes.secondgenome.com/

2、Silva

SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya). 

最新版本:SILVA 128

参考文献:PMID:23193283

下载地址:

https://www.arbsilva.de/documentation/release-128/

官网地址:

https://www.arb-silva.de/

3、RDP

RDP provides quality-controlled, aligned and annotated Bacterial and Archaeal 16S rRNA sequences, and Fungal 28S rRNA sequences, and a suite of analysis tools to the scientific community.


最新版本:Version 11.5

参考文献:PMID: 24288368

下载地址:

http://rdp.cme.msu.edu/misc/rel10info.jsp

官网地址:

http://rdp.cme.msu.edu/index.jsp

4、Unite

UNITE is a user-friendly Nordic ITS Ectomycorrhiza Database designed to provide a stable and reliable platform for sequence-borne identification of ectomycorrhizal asco- and basidiomycetes, including only high-quality sequences of well identified fungi. 


最新版本:Version 7.2

参考文献:PMID:15869663

下载地址:https://unite.ut.ee/repository.php

官网地址:https://unite.ut.ee/

5、FunGene

Functional Gene Pipeline Scripts contains a set of python scripts that allows to run one or more individual tools offered by RDP FunGene Pipeline. These tools are offered a modular fashion allowing researchers to choose the appropriate subset based on their needs.


最新版本:Version 9.3

参考文献:PMID: 24101916

官网地址:http://fungene.cme.msu.edu/

🚗  🚗  老司机点评:扩增子分析中,16S序列注释以Greegene、Silva和 RDP为主,早期Greegene用的最多,当然这与打包在QIIME中密不可分,2013年5月后就一直没有更新,做分析的童鞋纷纷转去用Silva注释,Silva基本上每年还是都有更新的,好玩的是,后面我们会讲到两个比较有名的功能预测软件,PICRUSt需要与Greengene配合使用,Tax4fun推荐与Silva配合使用。另外,真菌ITS注释主要还是应用Unite数据库。功能基因早期用NT库注释效果惨不忍睹,近几年Fungene不断完善,基本上是功能基因扩增子测序物种注释的不二选择了。 


07

序列比对

1、PyNAST

PyNAST is a reimplementation of the NAST sequence aligner, which has become a popular tool for adding new 16s rRNA sequences to existing 16s rRNA alignments.


最新版本:PyNAST 1.0

参考文献:PMID: 19914921

下载地址:

http://biocore.github.io/pynast/install.html

官网地址:

http://biocore.github.io/pynast/

2、Muscle

MUSCLE is an alignment method which stands for MUltiple Sequence Comparison by Log-Expectation. On average, MUSCLE is cited by ten new papers every day. 


最新版本:Version 3.8.31

参考文献:PMID:15034147

下载地址:

http://www.drive5.com/muscle/downloads.htm

官网地址:

http://www.drive5.com/muscle/

3、Mafft

MAFFT is a multiple sequence alignment program for unix-like operating systems.  It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc.


最新版本:Version 7.310

参考文献:PMID: 12136088

下载地址:

http://mafft.cbrc.jp/alignment/software/#Download%20and%20Installation

官网地址:

http://mafft.cbrc.jp/alignment/software/

4、Infernal

Infernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities. 


最新版本:Version 1.1.2

参考文献:PMID: 24008419

下载地址:

http://eddylab.org/infernal/#Downloads

官网地址:

 http://eddylab.org/infernal/

🚗  🚗  老司机点评:几款序列比对软件都打包在了QIIME中,调用 即可得到。几款软件中,Pynast和Infernal类似,都是基于参考库比对,但Infernal运行速度要慢得多,应用也少很多。Muscle和Mafft都是不依赖于参考库的全局比对软件,muscle号称每天产出十篇文章,虽然这个数字不只是微生物组的应用,但也不可谓不广泛,mafft与之类似,有测评软件显示mafft比对准确性高,但速度上没什么优势,目前对于没有好的参考库的序列比对时(如功能基因等),这俩方法都有应用。


08

功能预测

1、PICRUSt

PICRUSt (pronounced “pie crust”) is a bioinformatics software package designed to predict metagenome functional content from marker gene (e.g., 16S rRNA) surveys and full genomes.


最新版本:PICRUSt 1.1.2

参考文献:PMID:23975157

下载地址:

https://github.com/picrust/picrust

官网地址:

http://picrust.github.io/picrust/

2、Tax4Fun

Tax4Fun is a open-source R package that predicts the functional capabilities of microbial communities based on 16S datasets. Tax4Fun is applicable to output as obtained from the SILVAngs web server or the application of QIIME against the SILVA database.


参考文献:PMID:25957349

下载地址:

http://tax4fun.gobics.de/#Download

官网地址:http://tax4fun.gobics.de/

3、FAPROTAX

FAPROTAX is a database that maps prokaryotic clades (e.g. genera or species) to established metabolic or other ecologically relevant functions, using the current literature on cultured strains. 


最新版本:FAPROTAX 1.1

参考文献:PMID:28812567

下载地址:

http://www.zoology.ubc.ca/louca/FAPROTAX/lib/php/index.php?section=Download

官网地址:

http://www.zoology.ubc.ca/louca/FAPROTAX/lib/php/index.php

4、FUNGuild

An open annotation tool for parsing fungal community datasets by ecological guild.


参考文献:

https://doi.org/10.1016/j.funeco.2015.06.006

下载地址:

https://github.com/UMNFuN/FUNGuild.git

官网地址:

http://www.stbates.org/guilds/app.php

🚗  🚗  老司机点评:由于扩增子本身是对物种层面的分析,如能实现对其功能的预测,能解决的科学问题就多了。目前来说,功能预测软件仍以PICRUSt应用最多,但随着大家对古菌、真菌等多种非细菌群体的关注和注释数据库的更迭,其他软件应用也变多了。比如,上面我们说到,随着注释数据库的变更,Tax4Fun应用增多;专注于于环境样本的生物地球化学循环过程的FAPROTAX,真菌功能预测的FUNGuild等。


09

常用作图及统计软件

1、基础作图类

R ggplot2:

https://cran.rproject.org/web/packages/ggplot2/

Perl SVG: https://metacpan.org/pod/SVG

Python matplotlib: https://matplotlib.org/

QIIME: http://qiime.org/

2、物种统计及可视化

STAMP: kiwi.cs.dal.ca/Software/STAMP

LefSE:

http://huttenhower.sph.harvard.edu/galaxy/

Metastat: http://clovr.org/docs/metastats/

QIIME: http://qiime.org/

3、多样性分析

QIIME:http://qiime.org/

Mothur: https://www.mothur.org/

Usearch: http://drive5.com/usearch/

4、系统发生树可视化

GraPhlAn:http://huttenhower.org/galaxy/

iTOL: https://itol.embl.de/

5、环境因子分析

R vegan:

https://cran.r-project.org/web/packages/vegan/

Canoco5: http://www.canoco5.com/

6、网络互作分析

Cytoscape: http://www.cytoscape.org/

Gephi:https://gephi.org/

🚗  🚗  老司机点评:这部分给大家列了一些常见的软件,一般来说,如果得到了物种注释后的otu_table和序列比对后构建的发生树rep_phylo.tre,基础的分析部分就已经做完了,后续分析主要是基于物种统计及展示、组间比较(多样性--alpha_div,群落结构--beta_div等)、关联分析(网络互作、环境因子等),根据需求可能还会有功能预测分析等,结合其他验证类实验解释微生物多样性变化相关联的科学问题。




/End.


欢迎转发到朋友圈!


扫码关注,获取更多精彩内容

喜马拉雅FM搜索并订阅:生信者言;收听内容:

《一分钟听懂NGS基础概念》,让生信分析不再遥不可及

《亲爱的姑娘,你值得被温柔以待》,11个真实的人物故事

《众病之王:癌症传》,一起聆听人类对抗癌症的斗争史

回复文字:果然科学,看一篇好玩的科普文。

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存