单细胞入门之多样本整合
💡专注R语言在🩺生物医学中的使用
上一篇讲了单个样本的单细胞标准分析流程,但是通常我们做实验不会只测一个样本的,这时候就需要把多个样本整合起来。
下面是3种常见的整合方法,不过都是从Seurat
对象开始的,如果你是从GEO等数据库下载的数据,需要先自己创建Seurat
对象,然后数据质控,选择合适的细胞,再进行整合。
目录:
多个单细胞数据集的整合之CCA
整合之后的分析
多个单细胞数据集的整合之CCA+SCTransform
多个单细胞数据集的整合之harmony
参考资料
多个单细胞数据集的整合之CCA
library(Seurat)
## Attaching SeuratObject
## Attaching sp
library(SeuratData)
## ── Installed datasets ─────────────────────────── SeuratData v0.2.2 ──
## ✔ ifnb 3.1.0
## ───────────────────────────────── Key ────────────────────────────────
## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed
library(patchwork)
这里我们使用SeuratData
包自带的数据。这个数据有两个样本,一个是对照组,另一个是使用了干扰素刺激的。
# 安装数据,如果不成功可以下载后本地安装,报错信息里有下载地址
InstallData("ifnb")
使用示例数据集,如果是从表达矩阵开始,就是之前介绍过的,需要自己建立seurat
对象。
# 载入数据
ifnb <- LoadData("ifnb")
# 拆分数据,变为两个seurat对象
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list
## $CTRL
## An object of class Seurat
## 14053 features across 6548 samples within 1 assay
## Active assay: RNA (14053 features, 0 variable features)
##
## $STIM
## An object of class Seurat
## 14053 features across 7451 samples within 1 assay
## Active assay: RNA (14053 features, 0 variable features)
现在我们有了两个seurat
对象,需要进行合并。
下面使用seurat
的合并方法:CCA。大家可以去百度了解下技术原理,这里就不多说了。
# 首先对每个对象单独进行标准化及找出高变基因
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
# 选择相同的高变基因
features <- SelectIntegrationFeatures(object.list = ifnb.list)
然后进行整合,这一步比较费时间~
# 找锚点
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 16393 anchors
## Filtering anchors
## Retained 6756 anchors
# 整合
immune.combined <- IntegrateData(anchorset = immune.anchors)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
整合后的数据是这样的:
immune.combined
## An object of class Seurat
## 16053 features across 13999 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
整合之后的分析
就是之前介绍的标准分析流程。
# 选择整合好之后的数据进行下游分析,原始数据也在这个对象中,可以试试看两种数据的差别
DefaultAssay(immune.combined) <- "integrated"
#DefaultAssay(immune.combined) <- "RNA" # 原始数据
# 就是之前介绍的标准流程!中间很多可视化的过程省略了
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
## 17:08:36 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:08:36 Read 13999 rows and found 30 numeric columns
## 17:08:36 Using Annoy for neighbor search, n_neighbors = 30
## 17:08:36 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:08:37 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a41045c37
## 17:08:37 Searching Annoy index using 1 thread, search_k = 3000
## 17:08:39 Annoy recall = 100%
## 17:08:40 Commencing smooth kNN distance calibration using 1 thread
## 17:08:40 Initializing from normalized Laplacian + noise
## 17:08:41 Commencing optimization for 200 epochs, with 617860 positive edges
## 17:08:53 Optimization finished
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 13999
## Number of edges: 569052
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9057
## Number of communities: 15
## Elapsed time: 1 seconds
保存下结果,然后就可以进行各种后续的分析了,比如差异分析、细胞亚群注释、细胞通讯、拟时序分析、转录因子等等。
saveRDS(immune.combined, file = "../000files/immune_combined.rds")
看看有没有批次效应和降维结果:
# 降维结果
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
p1 + p2
然后是Identify conserved cell type markers
“FindMarkers()寻找的是指定两组细胞之间差异表达的基因;而FindConservedMarkers()分析的是某一个ident群两组细胞之间都保守表达的基因。
# 使用原数据
DefaultAssay(immune.combined) <- "RNA"
nk.markers <- FindConservedMarkers(immune.combined,
ident.1 = 6,
grouping.var = "stim",
verbose = FALSE)
head(nk.markers)
## CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
## GNLY 0 6.006173 0.944 0.045 0
## FGFBP2 0 3.243588 0.505 0.020 0
## CLIC3 0 3.461957 0.597 0.024 0
## PRF1 0 2.650548 0.422 0.017 0
## CTSW 0 2.987507 0.531 0.029 0
## KLRD1 0 2.777231 0.495 0.019 0
## STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
## GNLY 0.000000e+00 5.858634 0.954 0.059 0.000000e+00
## FGFBP2 3.408448e-165 2.191113 0.261 0.015 4.789892e-161
## CLIC3 0.000000e+00 3.536367 0.623 0.030 0.000000e+00
## PRF1 0.000000e+00 4.094579 0.862 0.057 0.000000e+00
## CTSW 0.000000e+00 3.128054 0.592 0.035 0.000000e+00
## KLRD1 0.000000e+00 2.863797 0.552 0.027 0.000000e+00
## max_pval minimump_p_val
## GNLY 0.000000e+00 0
## FGFBP2 3.408448e-165 0
## CLIC3 0.000000e+00 0
## PRF1 0.000000e+00 0
## CTSW 0.000000e+00 0
## KLRD1 0.000000e+00 0
画图展示:
FeaturePlot(immune.combined,
features = c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A","CCL2", "PPBP"),
min.cutoff = "q9") # 分位数
重命名类群ID,重新画图:
# 现在还都是数字,没有注释亚群
table(Idents(immune.combined))
immune.combined <- RenameIdents(immune.combined,
`0` = "CD14 Mono",
`1` = "CD4 Naive T",
`2` = "CD4 Memory T",
`3` = "CD16 Mono",
`4` = "B",
`5` = "CD8 T",
`6` = "NK",
`7` = "T activated",
`8` = "DC",
`9` = "B Activated",
`10` = "Mk",
`11` = "pDC",
`12` = "Eryth",
`13` = "Mono/Mk Doublets",
`14` = "HSPC"
)
DimPlot(immune.combined, label = TRUE)
气泡图展示:
Idents(immune.combined) <- factor(Idents(immune.combined),
levels = c("HSPC", "Mono/Mk Doublets","pDC", "Eryth", "Mk", "DC", "CD14 Mono", "CD16 Mono", "B Activated", "B", "CD8 T", "NK", "T activated","CD4 Naive T", "CD4 Memory T")
)
# 选择要展示的基因
markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", "NKG7", "CCL5","CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", "VMO1", "CCL2", "S100A9", "HLA-DQA1","GPR183", "PPBP", "GNG11", "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ", "PRSS57")
DotPlot(immune.combined, features = markers.to.plot, cols = c("blue", "red"), dot.scale = 8, split.by = "stim") +
RotatedAxis()
上面是找conserved gene,下面才是常见的找差异基因。
# 先展示下某些基因在两组中的表达量
library(ggplot2)
library(cowplot)
theme_set(theme_cowplot())
# 选择CD4 Naive T这个亚群继续探索
t.cells <- subset(immune.combined, idents = "CD4 Naive T")
Idents(t.cells) <- "stim" # 亚群名字改一下
# 表达矩阵预处理一下
avg.t.cells <- as.data.frame(log1p(AverageExpression(t.cells, verbose = FALSE)$RNA))
# 行名变成一列
avg.t.cells$gene <- rownames(avg.t.cells)
# 再选择CD14 Mono这个亚群
cd14.mono <- subset(immune.combined, idents = "CD14 Mono")
Idents(cd14.mono) <- "stim"
avg.cd14.mono <- as.data.frame(log1p(AverageExpression(cd14.mono, verbose = FALSE)$RNA))
avg.cd14.mono$gene <- rownames(avg.cd14.mono)
# 挑选要展示的基因,挑选的这几个都是对干扰素刺激有明显变化的基因(这需要平时自己积累)
genes.to.label = c("ISG15", "LY6E", "IFI6", "ISG20", "MX1", "IFIT2", "IFIT1", "CXCL10", "CCL8")
# 画图
p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) +
geom_point() +
ggtitle("CD4 Naive T Cells")
p1 <- LabelPoints(plot = p1,
points = genes.to.label,
repel = TRUE)
p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) +
geom_point() +
ggtitle("CD14 Monocytes")
p2 <- LabelPoints(plot = p2,
points = genes.to.label,
repel = TRUE)
p1 + p2
上面这幅图可以看出,我们挑选的这几个基因在stim组是高表达的,不管是CD4 Naive T亚群还是CD14 Mono都是如此。
假设现在我们已经确定在两组间的细胞类型都是一样的,现在我们想要找出在B细胞(B cell)中两组间(stim和ctrl)的差异基因。
# 首先增加一栏同时包含样本类型(stim和ctrl)和细胞类型的信息
immune.combined$celltype.stim <- paste(Idents(immune.combined),
immune.combined$stim,
sep = "_")
# 再增加一栏只包含细胞类型的信息
immune.combined$celltype <- Idents(immune.combined)
# 把idents换成celltype.stim这一栏信息
Idents(immune.combined) <- "celltype.stim"
# 这样就可以寻找任一细胞类型在两种组别(stim和ctrl)之间的差异基因了
b.interferon.response <- FindMarkers(immune.combined,
ident.1 = "B_STIM",
ident.2 = "B_CTRL",
verbose = FALSE)
head(b.interferon.response, n = 15)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IFIT1 3.210958e-189 4.803316 1.000 0.093 4.512360e-185
## ISG15 3.942443e-178 5.231845 1.000 0.486 5.540315e-174
## IFIT3 4.460609e-177 3.903763 0.993 0.312 6.268494e-173
## ISG20 3.273682e-176 3.836903 1.000 0.446 4.600505e-172
## IFITM3 1.865615e-173 3.018934 1.000 0.647 2.621748e-169
## IFIT2 3.125506e-170 3.738296 0.973 0.167 4.392273e-166
## IFI6 7.326993e-170 2.992315 0.998 0.374 1.029662e-165
## LY6E 2.335359e-169 3.007485 1.000 0.417 3.281880e-165
## RSAD2 2.442390e-169 3.648253 0.955 0.081 3.432291e-165
## TNFSF10 6.793514e-168 3.187577 1.000 0.483 9.546925e-164
## MX1 6.173675e-166 3.232834 0.973 0.165 8.675865e-162
## APOBEC3A 2.987551e-164 3.483479 0.996 0.403 4.198406e-160
## CXCL10 4.695421e-160 4.444529 0.986 0.275 6.598475e-156
## OASL 1.975803e-158 3.065838 0.957 0.182 2.776596e-154
## IRF7 4.890417e-149 2.576510 0.980 0.463 6.872503e-145
这些基因和我们在上一步中挑选的基因是不是基本一样?
画图展示,可以看出CD3D和GNLY在两组间表达差异不大,而IF16差异很大。
FeaturePlot(immune.combined,
features = c("CD3D", "GNLY", "IFI6"), # CD3D和GNLY对干扰素没反应哦~
split.by = "stim",
max.cutoff = 3,
cols = c("grey", "red"))
小提琴图也可以展示类似的效果:
plots <- VlnPlot(immune.combined, features = c("LYZ", "ISG15", "CXCL10"), split.by = "stim", group.by = "celltype",
pt.size = 0, combine = FALSE)
## The default behaviour of split.by has changed.
## Separate violin plots are now plotted side-by-side.
## To restore the old behaviour of a single split violin,
## set split.plot = TRUE.
##
## This message will be shown once per session.
wrap_plots(plots = plots, ncol = 1)
多个单细胞数据集的整合之CCA+SCTransform
和上面的步骤基本一样,就是把NormalizeData/FindVariableFeatures/ScaleData
合成一步:SCTransform
。
rm(list = ls())
ifnb <- LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")
# 这一步代替了NormalizeData/FindVariableFeatures/ScaleData,非常费时间
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 12747 by 6548
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 5000 cells
##
|======================================================================| 100%
## Found 77 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 12747 genes
##
|======================================================================| 100%
## Computing corrected count matrix for 12747 genes
##
|======================================================================| 100%
## Calculating gene attributes
## Wall clock passed: Time difference of 1.038463 mins
## Determine variable features
## Place corrected count matrix in counts slot
## Centering data matrix
## Set default assay to SCT
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 12658 by 7451
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 5000 cells
##
|======================================================================| 100%
## Found 73 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 12658 genes
##
|======================================================================| 100%
## Computing corrected count matrix for 12658 genes
##
|======================================================================| 100%
## Calculating gene attributes
## Wall clock passed: Time difference of 1.075562 mins
## Determine variable features
## Place corrected count matrix in counts slot
## Centering data matrix
## Set default assay to SCT
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
然后也是找锚点,整合数据:
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
anchor.features = features)
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 14831 anchors
## Filtering anchors
## Retained 11318 anchors
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
再往下就是一模一样的步骤了:
# SCTransform包含scale,这里就不用再次scale了
immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE) |>
FindNeighbors(dims = 1:15) |>
FindClusters(resolution = 0.5) |>
RunUMAP(dims=1:15) |>
RunTSNE(dims=1:15)
## 17:13:19 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:13:19 Read 13999 rows and found 30 numeric columns
## 17:13:19 Using Annoy for neighbor search, n_neighbors = 30
## 17:13:19 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:13:21 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a46a83352
## 17:13:21 Searching Annoy index using 1 thread, search_k = 3000
## 17:13:23 Annoy recall = 100%
## 17:13:23 Commencing smooth kNN distance calibration using 1 thread
## 17:13:24 Initializing from normalized Laplacian + noise
## 17:13:24 Commencing optimization for 200 epochs, with 601594 positive edges
## 17:13:37 Optimization finished
画图展示效果:
p1 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "seurat_annotations", label = TRUE,repel = TRUE)
p1 + p2
多个单细胞数据集的整合之harmony
rm(list = ls())
ifnb <- LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")
先使用merge()
合并。
ifnb.merge <- merge(x=ifnb.list[[1]],
y=ifnb.list[-1],
add.cell.ids= names(ifnb.list),
project="ifnb.harmoy")
head(colnames(ifnb.merge))
## [1] "CTRL_AAACATACATTTCC.1" "CTRL_AAACATACCAGAAA.1" "CTRL_AAACATACCTCGCT.1"
## [4] "CTRL_AAACATACCTGGTA.1" "CTRL_AAACATACGATGAA.1" "CTRL_AAACATACGGCATT.1"
接下来也是NormalizeData/FindVariableFeatures/ScaleData,这几步用SCTransform
也是可以的:
# 这一步用SCT代替也可以
ifnb.merge <- ifnb.merge |>
Seurat::NormalizeData() |>
FindVariableFeatures(selection.method = "vst", nfeatures = 2000) |>
ScaleData() |>
RunPCA()
## Centering and scaling data matrix
然后再进行harmony整合样本:
library(harmony)
## Loading required package: Rcpp
ifnb.merge <- RunHarmony(ifnb.merge,
group.by.vars = "stim")
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony 6/10
## Harmony 7/10
## Harmony 8/10
## Harmony converged after 8 iterations
names(ifnb.merge@reductions)
## [1] "pca" "harmony"
后面又是标准步骤,聚类,分群。
ifnb.merge <- RunUMAP(ifnb.merge, dims = 1:15, reduction = "harmony") |>
FindNeighbors(reduction = "harmony", dims = 1:15) |>
FindClusters(resolution = 0.5) |>
RunTSNE(dims=1:15,reduction = "harmony")
## 17:14:09 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:14:09 Read 13999 rows and found 15 numeric columns
## 17:14:09 Using Annoy for neighbor search, n_neighbors = 30
## 17:14:09 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:14:10 Writing NN index file to temp file C:\Users\liyue\AppData\Local\Temp\RtmpKY1e9T\file22a4682a6add
## 17:14:10 Searching Annoy index using 1 thread, search_k = 3000
## 17:14:12 Annoy recall = 100%
## 17:14:13 Commencing smooth kNN distance calibration using 1 thread
## 17:14:13 Initializing from normalized Laplacian + noise
## 17:14:14 Commencing optimization for 200 epochs, with 597860 positive edges
## 17:14:26 Optimization finished
画图展示效果:
p1 <- DimPlot(ifnb.harmony, reduction = "tsne", group.by = "stim")
p2 <- DimPlot(ifnb.harmony, reduction = "tsne", group.by = "seurat_annotations", label = TRUE,repel = TRUE)
p1 + p2
参考资料
Seurat官网 生信技能树、单细胞天地 菲沙基因单细胞培训课程
获取更多信息,欢迎加入🐧QQ交流群:613637742
“医学和生信笔记,专注R语言在临床医学中的使用、R语言数据分析和可视化。主要分享R语言做医学统计学、meta分析、网络药理学、临床预测模型、机器学习、生物信息学等。
往期回顾
ggplot2可视化拷贝数变异CNV的GISTIC score
R语言和医学统计学系列:样本量计算
长数据变为宽数据的7种情况!
宽数据变为长数据的5种情况!
使用scales自定义图形标签和刻度