查看原文
其他

Cell:新方法PopCOGenT鉴定微生物基因组间的基因流动

宏基因组 宏基因组 2023-08-18


基于微生物种群生物学定义的逆向生态学方法

A Reverse Ecology Approach Based on a Biological Definition of Microbial Populations

Cell, [36.216]

2019-08-08  Article

DOI: https://doi.org/10.1016/j.cell.2019.06.033

全文可开放获取 https://www.sciencedirect.com/science/article/pii/S0092867419307366

第一作者:Philip Arevalo1,4,5, David VanInsberghe1,5

通讯作者:Martin Polz3,6,*

其它作者:Joseph Elsherbini,Jeff Gore

作者单位:

1 麻省理工学院,微生物学研究生课程(Microbiology Graduate Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA)

2 麻省理工学院,物理系,生命系统物理学(Physics of Living Systems, Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA)

3 麻省理工学院,土木与环境工程系(Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA)

日报

https://www.mr-gut.cn/papers/read/1051478162

Cell:PopCOGenT新方法鉴定微生物基因组间的基因流动

  1. 一种新方法PopCOGenT可以估计共存的微生物基因组之间最近的基因流动;

  2. 最近基因流的网络包含对应于物种的分离簇;

  3. 基因流不连续性描绘了自适应优化的群体,将种群定位到样本上可以改进关联研究;

  4. 应用于人类共生细菌活泼瘤胃球菌分裂成明显不同的群体,显示出与健康和疾病的不同关联;

  5. 基于最近的基因流定义种群,将有助于为植物和动物开发的生态和进化理论分析细菌和古细菌基因组,从而允许测试所有生物学中的统一原理。

主编评语:2001年人类基因组发布后,进入了后基因组时代,开展人类基因组的多角度挖掘,伴随着一系列的方法和软件的开发,实现了更多研究问题的手段。近年来微生物组研究也积累了大量数据,以人类微生物组计划公布的数据为基础,大家可以将之前很多想法付诸实践。近期Cell就有多篇基于HMP数据挖掘的文章,如本周内同期发表的《Cell:20种宏基因组学物种分类工具大比拼》(http://www.mr-gut.cn/papers/read/1033790381 )、《Cell:挖掘人体菌群中不为人知的小蛋白》(http://www.mr-gut.cn/papers/read/1056603186 )。本文是又一篇基于人类微生物组数据挖掘并开发方法的Cell文章,套路值得学习,软件值得跟进和使用。

摘要

在微生物中描绘具有生态意义的群体对于确定其在环境和宿主相关微生物组中的作用非常重要。在这里,我们介绍了最近基因流的度量,当应用于共存的微生物时,识别由其近亲的强基因流不连续性分隔的全等遗传和生态单元。然后,我们开发了一套工作流程来识别这些单元中的基因组区域,这些区域显示差异适应并允许将种群映射到环境变量或宿主关联。使用这种逆向生态学方法,我们显示人类共生细菌活泼瘤胃球菌分裂成明显不同的群体,显示出与健康和疾病的不同关联。通过这种方式,基于最近的基因流定义种群,将有助于使用为植物和动物开发的生态和进化理论分析细菌和古细菌基因组,从而允许测试所有生物学中的统一原理。

Delineating ecologically meaningful populations among microbes is important for identifying their roles in environmental and host-associated microbiomes. Here, we introduce a metric of recent gene flow, which when applied to co-existing microbes, identifies congruent genetic and ecological units separated by strong gene flow discontinuities from their next of kin. We then develop a pipeline to identify genome regions within these units that show differential adaptation and allow mapping of populations onto environmental variables or host associations. Using this reverse ecology approach, we show that the human commensal bacterium Ruminococcus gnavus breaks up into sharply delineated populations that show different associations with health and disease. Defining populations by recent gene flow in this way will facilitate the analysis of bacterial and archaeal genomes using ecological and evolutionary theory developed for plants and animals, thus allowing for testing unifying principles across all biology.

主要结果

图1. 重组微生物基因组比非重组微生物基因组共享更长和更高频的一致性区域

Figure 1. Recombinogenic Microbial Genomes Share Longer and More Frequent Regions of Identity than Non-recombinogenic Microbial Genomes

(A)PopCOGenT方法测量最近基因转移的示意图。SNP的分布(左)和非重组和重组基因组中相同区域(右)的预期分布。

(B)来自代表非重组和重组细菌和古细菌的七组微生物的基因组对的相同基因组区域的分布的实例。分布测量成对基因组比对的分数,其发生在由两个基因组的分歧(x轴)归一化的任意最小大小的相同区域中;基于中性突变累积的空模型的预期分布显示为灰线。

(C)相同基因组区域分布的长度偏差随着非重组菌株(圆圈)中的基因组大小线性增加,而在重组菌株(正方形)中,它超过该预测,如针对每个指示的进化枝测量的中值长度偏差所示。误差线表示长度偏差的四分位数范围。实心灰线,最适合非重组菌株长度偏差与基因组大小的线性回归; 虚线灰色线,线性回归的90%预测区间的上限。

(D)长一致区域中核糖体蛋白的比例随着非重组微生物中的基因组大小而增加(STAR方法,“核糖体蛋白富集”)。每个点代表每个指定微生物组在这些区域中核糖体蛋白的分数。方框图显示了长相同区域中核糖体蛋白部分的四分位数范围,须线表示四分位数范围的1.5倍。

(A) Schematic of PopCOGenT method of measuring recent gene transfer. Distribution of SNPs (left) and the expected distribution of identical regions (right) within non-recombinogenic and recombinogenic genomes.

(B) Examples of the distributions of identical genome regions of pairs of genomes from seven groups of microbes representing non-recombinogenic and recombinogenic bacteria and archaea. Distributions measure the fraction of a pairwise genome alignment that occurs in an identical region of an arbitrary minimum size normalized by the divergence of the two genomes (x axis); expected distribution based on null model of neutral mutational accumulation shown as a gray line.

(C) Length bias of identical genome region distributions increases linearly with genome size in non-recombinogenic strains (circles) while in recombinogenic strains (squares) it exceeds this prediction as indicated by the median length bias measured for each indicated clade. Error bars represent the interquartile range of length bias. Solid gray line, best fit linear regression of length bias against genome size for non-recombinogenic strains; dotted gray line, upper bound of the 90% prediction interval of the linear regression.

(D) The fraction of ribosomal proteins in long identical regions increases with genome size in non-recombinogenic microbes (STAR Methods, “Ribosomal protein enrichment”). Each point represents the fraction of ribosomal proteins in such regions for each indicated microbial group. The box plots show the interquartile range of the fraction of ribosomal proteins in long identical regions and the whiskers indicate 1.5 times the interquartile range.

图2. 长度偏差测量最近的基因转移事件

Figure 2. Length Bias Measures Recent Gene Transfer Events

测量了平均长度偏差(红色虚线),同质性与突变多态性的比率(h/ m,[A]绿色虚线)和重组与突变的比率(r/θ,[B]紫色虚线)。根据图下方的系统发育树模拟的一百万个碱基(MB)的基因组的进化过程(STAR方法,“转移后的发散模拟”)。50个模拟中的平均长度偏差,h/m和r/θ的最小值和最大值显示为阴影区域。蓝色虚线表示何时模拟基因组之间的转移。

The mean length bias (red dashed line), ratio of homoplasies to mutational polymorphisms (h/m, [A] green dashed line) and ratio of recombination to mutation (r/θ, [B] purple dashed line) were measured over the course of evolution for 1 megabase (MB) genomes simulated according to the phylogenetic trees below the plots (STAR Methods, “Simulation of divergence after transfer”). The minimum and maximum values of mean length bias, h/m, and r/θ over 50 simulations are shown as the shaded regions. The blue dashed line indicates when transfers between genomes were simulated.

图3.最近基因流网络中的簇对应于先前鉴定的弧菌,硫化叶菌和原绿球藻的生态种群

Figure 3. Clusters in Networks of Recent Gene Flow Correspond to Previously Identified Ecological Populations for Vibrio, Sulfolobus islandicus, and Prochlorococcus

(A-C)核糖体参考树(左图)和完整的基因流网络(右图),用于弧菌Vibrionaceae(A),硫化叶菌Sulfolobus islandicus(B)和原绿球藻Prochlorococcus(C)。节点代表基因组,边缘代表它们之间基因流的推断量,如通过纯化选择阈值校正的相同基因组区域的观察分布的长度偏差所测量的(参见文本)。黑色和灰色边缘颜色表示群体内和群体之间的基因流动,其中通过将标准聚类算法(Infomap; Rosvall等人,2009)应用于原始基因流网络来鉴定群体。边缘厚度对应于基因流量,节点大小与克隆簇大小相对应,即,如果它们过于密切相关而无法准确评估转移(<0.035%分歧),则株合并为单成单个组(STAR方法,“网络构建和聚类“)。彩色节点和叶子代表在先前研究中分配给群体的菌株,灰色节点和叶子代表具有未知群体分配的菌株。(A)中的人口分配来自Hunt等人(2008)和Preheim等(2011),(B)来自Cadillo-Quiroz等人(2012),和(C)来自Kashtan等(2014)。(A)中的虚线椭圆表示嗜弧菌(Vibrio cyclitrophicus)的种群。

(A–C) The ribosomal reference tree (left panel) and complete gene flow network (right panel) for Vibrionaceae (A), Sulfolobus islandicus (B), and Prochlorococcus (C). Nodes represent genomes and edges represent the inferred amount of gene flow between them as measured by the length bias of the observed distribution of identical genome regions corrected by a purifying selection threshold (see text). Black and gray edge color denotes gene flow within and between populations, where populations were identified by application of a standard clustering algorithm (Infomap; Rosvall et al., 2009) to the raw gene flow network. Edge thickness corresponds to the amount of gene flow, and node size to size of clonal clusters, i.e., strains collapsed into a single group if they were too closely related to have transfer accurately evaluated (<0.035% divergence) (STAR Methods, “Network construction and clustering”). Colored nodes and leaves represent strains assigned to populations in previous studies, gray nodes and leaves represent strains with unknown population assignment. Population assignments in (A) from Hunt et al. (2008) and Preheim et al. (2011), (B) from Cadillo-Quiroz et al. (2012), and (C) from Kashtan et al. (2014). Dashed oval in (A) indicates populations of Vibrio cyclitrophicus.

图4. 活泼瘤胃球菌种群中的精细种群结构是显而易见的

Figure 4.  Fine Scale Population Structure Is Evident in R. gnavus Populations

(A)PopCOGenT基于基因流确定了三个独立的簇,鉴定为群体I,II和III。黑色边缘代表群体内的基因流动,灰色边缘代表群体之间的基因流动,边缘厚度对应于基因组之间推断的基因流动的量,并且节点大小代表克隆簇的大小,即,如果它们过于密切相关的转移准确评估(<0.035%分歧)则合并成单个群体 (STAR方法,“网络构建和聚类”)。隔离源由节点颜色指示,种群分配由节点轮廓颜色指示。

(B)条形图显示支持每个群体的单一性的总对齐长度。同时支持所有种群单一性的比对长度显示为灰色条。

(A) PopCOGenT identified three separate clusters based on gene-flow, identified as populations I, II, and III. Black edges represent gene flow within populations, gray edges represent gene flow between populations, edge thickness corresponds to the amount of inferred gene flow between genomes, and node size represents the size of clonal clusters, i.e., strains collapsed into a single group if they were too closely related to have transfer accurately evaluated (<0.035% divergence) (STAR Methods, “Network construction and clustering”). Isolation source is indicated by node color and population assignment is indicated by node outline color.

(B) The bar graph shows the total alignment length supporting the monophyly of each population. The alignment length supporting the monophyly of all populations simultaneously is shown as a gray bar.

图5. 最近经历过群体特异性选择性扫描的基因座具有较低的种群内多样性,在整个基因组中分布,并受限于特定蛋白质

Figure 5. Loci that Have Recently Undergone Population-Specific Selective Sweeps Have Low within-Population Diversity, Are Distributed throughout the Genome, and Are Restricted to Specific Proteins

(A-C)顶部小图中的每个点代表群体I和II中每个预测的扫描区域的群体内(A)和群体间(B)核苷酸多样性。虚线表示全基因组平均种群内(A)和种群间(B)核苷酸多样性。我们还测量所有扫描区域(C)的固定指数(fixation index, Fst),并显示整个基因组测量的固定指数为虚线。

(D)所有群体I和II扫描区域的位置显示在参考基因组FJUS01中。连续序列之间的断裂用水平红线表示,每个重叠群交替着色为黑色或灰色,以进一步突出组装中的断裂。扫描区域的位置显示在那些基因座的基因图上,以突出扫描如何限于基因组中的特定基因和结构域。群体I和II的扫描分别以红色和蓝色显示。

(A–C) Each point in the top panels represents the within-population (A) and between-population (B) nucleotide diversity for each of the predicted sweep regions in populations I and II. Dashed lines show the genome-wide average within-population (A) and between-population (B) nucleotide diversity. We also measure the fixation index (Fst) for all sweep regions (C) and show the fixation index measured across the entire genome as a dashed line.

(D) The location of all populations I and II sweep regions are shown in the reference genome FJUS01. The breaks between contiguous sequences are indicated with horizontal red lines, and each contig is alternately colored black or gray to further highlight breaks in the assembly. The locations of sweep regions are shown on gene diagrams of those loci to highlight how sweeps are restricted to specific genes and domains in the genome. Sweeps from population I and II are shown in red and blue, respectively.

图6. 活泼瘤胃球菌核心基因组内的基因特异性扫描区分种群I和II

Figure 6. Gene-Specific Sweeps within the R. gnavus Core Genome Differentiate Populations I and II

(A-C)每个图显示假定扫描区域中沿50bp窗口的群体区分SNP的密度。群体区分SNP被定义为在一个群体中固定但在所有其他群体中完全不存在的SNP。小组描绘了仅在群体II(A)中扫描的区域,仅在群体I(B)中,并且在每个群体中具有单独的等位基因扫描(C)。使用最大似然系统发育树说明扫描区域中的所有菌株与其周围区域之间的序列相似性,并且在每个区域上方指示每个区域中的基因。比例尺表示每个位点的替换率。

(A–C) Each plot shows the density of population-differentiating SNPs along 50-bp windows in putative sweep regions. Population-differentiating SNPs are defined as SNPs that have fixed in one population but are entirely absent from all other populations. The panels depict regions that swept only in population II (A), only in population I (B), and that had separate alleles sweep in each population (C). The sequence similarity between all strains in the sweep regions and the areas that surround them are illustrated using maximum likelihood phylogenetic trees and genes in each region are indicated above each panel. The scale bar indicates substitutions per site.

图7. 人群I和II在健康和患病的受试者中富集差异

Figure 7. Populations I and II Are Differentially Enriched in Healthy and Diseased Subjects

(A-D)每个图显示群体I(A和C)和群体II(B和D)中群体特异性SNP(A和B)和群体特异性基因(C和D)的宏基因组读长的覆盖率。测试了三种不同的宏基因组样本:健康受试者(20名受试者,5.5×108读长),溃疡性结肠炎受试者(UC; 6名受试者,4.8×108读长)和克罗恩病患者(CD; n = 52名受试者,1.2) ×109读长)。误差棒显示95%二项式置信区间。

(A–D) Each plot shows the coverage of metagenomic reads at population-specific SNPs (A and B) and population-specific genes (C and D) in population I (A and C) and population II (B and D). Three different metagenomic sample types were tested: healthy subjects (20 subjects, 5.5 × 108 reads), subjects with ulcerative colitis (UC; 6 subjects, 4.8 × 108 reads), and subjects with Crohn’s disease (CD; n = 52 subjects, 1.2 × 109 reads). Error bars show 95% binomial confidence interval.

猜你喜欢

10000+:菌群分析 宝宝与猫狗 梅毒狂想曲 提DNA发Nature Cell专刊 肠道指挥大脑

系列教程:微生物组入门 Biostar 微生物组  宏基因组

专业技能:学术图表 高分文章 生信宝典 不可或缺的人

一文读懂:宏基因组 寄生虫益处 进化树

必备技能:提问 搜索  Endnote

文献阅读 热心肠 SemanticScholar Geenmedical

扩增子分析:图表解读 分析流程 统计绘图

16S功能预测   PICRUSt  FAPROTAX  Bugbase Tax4Fun

在线工具:16S预测培养基 生信绘图

科研经验:云笔记  云协作 公众号

编程模板: Shell  R Perl

生物科普:  肠道细菌 人体上的生命 生命大跃进  细胞暗战 人体奥秘  

写在后面

为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外5000+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份,另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助,首先阅读《如何优雅的提问》学习解决问题思路,仍未解决群内讨论,问题不私聊,帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”

点击阅读原文,跳转最新文章目录阅读

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存