phyloseq | 用 R 分析微生物组数据及可视化（三）

Original 鲍志炜生信菜鸟团 2022-07-05

收录于合集 #宏基因组数据分析 24个

上期回顾

在之前的两篇中，介绍了phyloseq包的功能与数据导入以及phyloseq包中的各种绘图函数，在这一期中主要介绍如何使用phyloseq包进行排序分析（Ordination analysis）。

排序分析

排序分析（Ordination analysis）是探索复杂的系统发育测序数据的得力工具。简而言之，排序（ordination）的过程就是在一个可视化的低维空间重新排列这些样方，使得样方之间的距离最大程度地反映出平面散点图内样方之间的关系信息。phyloseq分别通过ordinate()和plot_ordination()函数，来进行排序分析及其可视化。虽然排序分析有很多方法和参数，但最基本的代码就是这样：

my.physeq <- import("Biom", BIOMfilename="myBiomFile.biom")
my.ord    <- ordinate(my.physeq)
plot_ordination(my.physeq, my.ord, color="myFavoriteVarible")

PCoA 主坐标分析

PCoA（principal co-ordinates analysis）主坐标分析法是一种与 PCA 类似的降维排序方法。PCoA与PCA的区别在于PCA是基于原始的物种组成矩阵所做的分析，仅仅比较的是物种丰度的不同，而 PCoA 首先根据不同的距离算法计算样品之间的距离，然后对距离矩阵进行处理，使图中点间的距离正好等于原来的差异数据，实现定性数据的定量转换。

下面以 Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample 这篇文章的 Figure 5 为例，作者展示了 unweighted-UniFrac 距离矩阵 PCoA 的前三个轴。

下面将在完整的数据集上再现 unweighted-UniFrac 距离矩阵的计算。由于 OTU 数量众多，所以可能需要很长的时间，建议对大型数据集进行并行化运算，有关并行化的详细信息，请参阅UniFrac()的文档及示例。

data(GlobalPatterns)
# GPUF <- UniFrac(GlobalPatterns)
# Load the pre-computed distance matrix, GPUF
load(system.file("doc", "Unweighted_UniFrac.RData", package="phyloseq"))
# Calculate the PCoA on this distance matrix, GPUF
GloPa.pcoa = ordinate(GlobalPatterns, method="PCoA", distance=GPUF)

在查看结果之前，我们可以用碎石图来查看排名前几的轴的贡献率：

plot_scree(GloPa.pcoa, "Scree plot for Global Patterns, UniFrac/PCoA")

通过碎石图，可以看出前三个轴代表距离总变化的43％，第四轴代表另外9％，因此也可能需要进行探索。碎石图是任何排序方法的重要工具，因为轴的相对贡献率在不同的数据集之间可能变化很大。

下面开始重复 Figure 5，但我们将原来三维的图，转化为两张图：

(p12 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", color="SampleType") + 
  geom_point(size=5) + geom_path() + scale_colour_hue(guide = FALSE) )

(p13 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", axes=c(1, 3),
  color="SampleType") + geom_line() + geom_point(size=5) )

NMDS 非度量多维尺度分析

非度量多维尺度分析是一种将多维空间的研究对象（样本或变量）简化到低维空间进行定位、分析和归类，同时又保留对象间原始关系的数据分析方法。适用于无法获得研究对象间精确的相似性或相异性数据，仅能得到他们之间等级关系数据的情形。其基本特征是将对象间的相似性或相异性数据看成点间距离的单调函数，在保持原始数据次序关系的基础上，用新的相同次序的数据列替换原始数据进行度量型多维尺度分析。

其特点是根据样品中包含的物种信息，以点的形式反映在多维空间上，而对不同样品间的差异程度，则是通过点与点间的距离体现的，最终获得样品的空间定位点图。

# (Re)load UniFrac distance matrix and GlobalPatterns data
data(GlobalPatterns)
load(system.file("doc", "Unweighted_UniFrac.RData", package="phyloseq"))
# perform NMDS, set to 2 axes
GP.NMDS <- ordinate(GlobalPatterns, "NMDS", GPUF)

## Run 0 stress 0.1432774 
## Run 1 stress 0.1432625 
## ... New best solution
## ... Procrustes: rmse 0.003499558  max resid 0.01265589 
## Run 2 stress 0.2467057 
## Run 3 stress 0.2234687 
## Run 4 stress 0.1432774 
## ... Procrustes: rmse 0.003483686  max resid 0.01260292 
## Run 5 stress 0.1432774 
## ... Procrustes: rmse 0.003487121  max resid 0.01261713 
## Run 6 stress 0.1856306 
## Run 7 stress 0.1432625 
## ... New best solution
## ... Procrustes: rmse 1.405265e-05  max resid 3.151096e-05 
## ... Similar to previous best
## Run 8 stress 0.2574296 
## Run 9 stress 0.1809282 
## Run 10 stress 0.1809285 
## Run 11 stress 0.1432625 
## ... New best solution
## ... Procrustes: rmse 4.788746e-06  max resid 1.24343e-05 
## ... Similar to previous best
## Run 12 stress 0.1432774 
## ... Procrustes: rmse 0.003470474  max resid 0.01255052 
## Run 13 stress 0.1432625 
## ... Procrustes: rmse 1.152937e-05  max resid 3.31644e-05 
## ... Similar to previous best
## Run 14 stress 0.1432625 
## ... New best solution
## ... Procrustes: rmse 2.462092e-06  max resid 6.050872e-06 
## ... Similar to previous best
## Run 15 stress 0.1856315 
## Run 16 stress 0.1432625 
## ... Procrustes: rmse 9.826556e-06  max resid 3.260556e-05 
## ... Similar to previous best
## Run 17 stress 0.1809284 
## Run 18 stress 0.1856306 
## Run 19 stress 0.1432774 
## ... Procrustes: rmse 0.003490316  max resid 0.01261738 
## Run 20 stress 0.1432774 
## ... Procrustes: rmse 0.003471665  max resid 0.01255283 
## *** Solution reached

(p <- plot_ordination(GlobalPatterns, GP.NMDS, "samples", color="SampleType") +
  geom_line() + geom_point(size=5) )

使用非度量多维尺度分析对GlobalPatterns数据集的样本之间的 unweighted-UniFrac 距离矩阵进行排序。每个样本点根据环境类型着色，如果它们属于同一类型，则由一条线连接。与GlobalPatterns文章中的图5进行比较。该图很好地显示了来自不同栖息地的微生物群落之间的相对差异。但是，它未能表明群落之间的差异。对于提供解释样本（或样本组）之间差异的群落信息的排序方法，我们使用对应分析（CA）。

CA 对应分析

在这一节中，我们使用 Correspondence Analysis 的排序方法的各种功能继续探索GlobalPatterns数据集。

让我们首先进行对应分析并查看碎石图。碎石图表明GlobalPatterns数据集具有相当高的维度，前两个 CA 轴占总变异的不到17％，但随着轴数的增加，贡献度并没有急剧下降，每个轴仅比前一个下降一点点。

首先，为了节约运行时间，我们只取GlobalPatterns数据集的子集：

data(GlobalPatterns)
# Take a subset of the GP dataset, top 200 species
topsp <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:200])
GP    <- prune_taxa(topsp, GlobalPatterns)
# Subset further to top 5 phyla, among the top 200 OTUs.
top5ph <- sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), decreasing=TRUE)[1:5]
GP     <- subset_taxa(GP, Phylum %in% names(top5ph))
# Define a human-associated versus non-human categorical variable:
human <- get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
# Add new human variable to sample data:
sample_data(GP)$human <- factor(human)

进行 CA 对应分析：

# Now perform a unconstrained correspondence analysis
gpca  <- ordinate(GP, "CCA")
# Scree plot
plot_scree(gpca, "Scree Plot for Global Patterns Correspondence Analysis")

(p12 <- plot_ordination(GP, gpca, "samples", color="SampleType") + 
  geom_line() + geom_point(size=5) )

(p34 <- plot_ordination(GP, gpca, "samples", axes=c(3, 4), color="SampleType") + 
  geom_line() + geom_point(size=5) )

第一张图说明粪便和模拟数据紧密地聚集在一起，远离 CA1 上的其他样本。在 CA2 上，皮肤和舌头样本也类似地分开。总而言之，前两个轴已经解释了人类相关“环境”与数据集中其他非人类环境的分离，以及舌头和皮肤样本与粪便的分离。

我们现在将进一步研究这种数据结构，使我们能够比较单个分类群在相同图形空间上的贡献度：

p1  <- plot_ordination(GP, gpca, "species", color="Phylum")
(p1 <- ggplot(p1$data, p1$mapping) + geom_point(size=5, alpha=0.5) + 
  facet_wrap(~Phylum) +  scale_colour_hue(guide = FALSE) )

我们还可以将物种点总结为 2D 密度图：

(p3 <- ggplot(p1$data, p1$mapping) + geom_density2d() +
  facet_wrap(~Phylum) +  scale_colour_hue(guide = FALSE) )

上面的图表展示了一些有用的模式和有趣的异常值，但如果我们想要具体了解这些门在多大程度上影响每个轴呢？下面的代码使用了箱线图来展示这个结果。

library("reshape2")
# Melt the species-data.frame, DF, to facet each CA axis separately
mdf <- melt(p1$data[, c("CA1", "CA2", "Phylum", "Family", "Genus")], 
            id=c("Phylum", "Family", "Genus") )
# Select some special outliers for labelling
LF <- subset(mdf, variable=="CA2" & value < -1.0)
# build plot: boxplot summaries of each CA-axis, with labels
p <- ggplot(mdf, aes(Phylum, value, color=Phylum)) + 
  geom_boxplot() + 
  facet_wrap(~variable, 2) + 
  scale_colour_hue(guide = FALSE) +
  theme_bw() + 
  theme( axis.text.x = element_text(angle = -90, vjust = 0.5) )
# Add the text label layer, and render ggplot graphic
(p <- p + geom_text(data=subset(LF, !is.na(Family)),
  mapping = aes(Phylum, value+0.1, color=Phylum, label=Family), 
  vjust=0,
  size=2))

通过这种方法，更容易看到相对于其他门的异常聚集的特定物种，例如上图中的拟杆菌（Bacteroidetes）属的 Prevotellaceae，在舌/皮肤样品中负 CA2 方向贡献最多。

另一种方法是直接展示物种丰度来说明菌群的差异变化，使用前面提及的plot_bar函数：

plot_bar(GP, x="human", fill="SampleType", facet_grid= ~ Phylum)

从上面这个图中也可以看出一些模式：

Cyanobacteria 和 Actinobacteria 在人类样品较少；
Firmicutes 在人类样本中较多；
Acidobacteria 和 Verrucomicrobia 在粪便样本中较多。

根据之前的研究，这些结果并不是非常令人惊讶，但希望它可以应用于你所拥有的其他数据集中。

DPCoA 双主坐标分析

下面的示例展示了使用了phyloseq中的plot_ordination()函数来进行双主坐标分析：

# Perform ordination
GP.dpcoa <- ordinate(GP, "DPCoA") 
# Generate default ordination bi-plot
pdpcoa <- 
  plot_ordination(
    physeq = GP, 
    ordination = GP.dpcoa, 
    type="biplot",
    color="SampleType", 
    shape="Phylum")
# Adjust the shape scale manually 
# to make taxa hollow and samples filled (advanced)
shape.fac <- pdpcoa$data$Phylum
man.shapes <- c(19, 21:25)
names(man.shapes) <- c("Samples", levels(shape.fac)[levels(shape.fac)!="Samples"])
p2dpcoa <- pdpcoa + scale_shape_manual(values=man.shapes)
p2dpcoa

# Show just Samples or just Taxa
plot_ordination(GP, GP.dpcoa, type="taxa", shape="Phylum")

plot_ordination(GP, GP.dpcoa, type="samples", color="SampleType")

# Split
plot_ordination(GP, GP.dpcoa, type="split",
                color="SampleType", shape="Phylum") +
  ggplot2::scale_colour_discrete()

聚类分析

层级聚类（例如，hclust）是可视化样本距离矩阵的另一种流行的方式。在下面的示例中，我们使用了 unweighted UniFrac 距离矩阵和 UPGMA 方法（hclust参数： method="average"）：

# (Re)load UniFrac distance matrix and GlobalPatterns data
data(GlobalPatterns)
load(system.file("doc", "Unweighted_UniFrac.RData", package="phyloseq"))
# Manually define color-shading vector based on sample type.
colorScale    <- rainbow(length(levels(get_variable(GlobalPatterns, "SampleType"))))
cols          <- colorScale[get_variable(GlobalPatterns, "SampleType")] 
GP.tip.labels <- as(get_variable(GlobalPatterns, "SampleType"), "character")
# This is the actual hierarchical clustering call, specifying average-link clustering
GP.hclust     <- hclust(GPUF, method="average")
plot(GP.hclust, col=cols)