Seurat 官网单细胞教程一 (基础教程)
见不得人间疾苦
1引言
Seurat 包已经更新到 4.0.6 版本了,可以说是非常快速,当然其功能也是越来越强大了。对于单细胞刚入门的孩子们来说,也许不知从何开始, 学习官网推出的教程还是非常有必要的! 后面会陆续分享 Seurat 官网的一系列教程。
这篇是单个数据集的常规基础分析,内容包括一下几个方面:
1.Setup the Seurat Object 构建seurat对象
2.Standard pre-processing workflow 常规处理
3.Normalizing the data 归一化
4.Identification of highly variable features (feature selection) 鉴定高变基因
5.Scaling the data 标准化
6.Perform linear dimensional reduction 线性降维
7.Determine the ‘dimensionality’ of the dataset 确定数据维度
8.Cluster the cells 细胞聚类
9.Run non-linear dimensional reduction (UMAP/tSNE) 非线性降维
10.Finding differentially expressed features (cluster biomarkers) 寻找亚群 marker 基因
11.Assigning cell type identity to clusters 确定细胞类型
下面就是这些每个过程的具体演示代码。
参考文档地址:
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
2构建 seurat 对象
10X Genomics 测序平台上游 cellranger
软件会产生三个文件,包括基因,细胞和表达矩阵,我们需要使用 Read10X() 函数进行读取, 测试数据来自外周血单核细胞的单细胞测序,包含 2700 个细胞,数据下载链接如下:
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
library(dplyr)
library(Seurat)
library(patchwork)
# Load the PBMC dataset
# 读取数据
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
# 创建对象
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k",
min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
min.cells 过滤少于在 3 个细胞中表达的基因, min.features 过滤表达小于 200 个基因的细胞。
查看矩阵:
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
矩阵列为细胞
,行为基因名
。
3数据预处理
这步基本是过滤一些低质量的细胞(线粒体污染或者死细胞)
或者基因(双细胞,多细胞或者空液滴)
:
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
# 计算每个细胞的线粒体含量
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# 小鼠
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^mt-")
绘图展示:
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
上面分别展示了每个细胞的 基因数量, counts 数量 及 线粒体含量。
两个指标之间的散点图:
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
过滤:
我们过滤基因表达大于 200 小于 2500,且线粒体含量低于 0.05 的细胞:
pbmc <- subset(pbmc,
subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
4归一化
因为每个细胞 测序深度不一样, 测序总 reads 数量不一样,所以需要进行归一化来去除测序深带来的差异,使样本间具有可比性。
默认使用 LogNormalize
方法进行归一化,使用 pbmc[["RNA"]]@data 可以获取归一化的数据:
pbmc <- NormalizeData(pbmc,
normalization.method = "LogNormalize",
scale.factor = 10000)
# OR default args
pbmc <- NormalizeData(pbmc)
5鉴定高变基因(特征选择)
此步是为了寻找在不同细胞间变化较大的基因(高表达或低表达)
:
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
# 获取高变基因
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
# 可视化
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
6标准化
为了使基因在不同细胞间更好的比较和方便下游可视化,需要对归一化的数据进行标准化(Z-score)
处理,使其均值为 0,方差为 1
,使用 pbmc[["RNA"]]@scale.data 可获取标准化的数据:
# 对所有基因进行标准化
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
# 对高变基因进行标准化,影响热图可视化,不影响PCA降维
pbmc <- ScaleData(pbmc)
去除其它影响因素:
vars.to.regress 参数可以去除指定因素的影响,比如线粒体含量
,核糖体基因含量
,细胞周期
等等:
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
7线性降维
对高变基因进行线性降维,降低数据分析量:
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
# 查看结果
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
可视化降维结果:
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
8确定数据主成分维度
选取合适的维度进行后续分析, JackStrawPlot 函数可以找出显著性的主成分(虚线以上)
:
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
或者使用拐点图寻找平滑线前面的部分:
ElbowPlot(pbmc)
9细胞聚类
Seurat v3 使用了基于图形聚类的方法,基于SNN
和KNN
的算法进行聚类:
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95927
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds
查看聚类结果:
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
10UMAP/tSNE 非线性降维
seurat 还提供了 UMAP/tSNE
的非线性降维的方法来可视化数据集:
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
# SAVE
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")
11寻找 Marker 基因
寻找亚群 2 细胞的特异性表达基因:
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IL32 2.892340e-90 1.2013522 0.947 0.465 3.966555e-86
## LTB 1.060121e-86 1.2695776 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71 0.9389621 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68 1.1873213 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67 0.8969774 0.954 0.614 2.252497e-63
一个亚群和多个亚群差异分析:
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 8.246578e-205 4.261495 0.975 0.040 1.130936e-200
## IFITM3 1.677613e-195 3.879339 0.975 0.049 2.300678e-191
## CFD 2.401156e-193 3.405492 0.938 0.038 3.292945e-189
## CD68 2.900384e-191 3.020484 0.926 0.035 3.977587e-187
## RP11-290F20.3 2.513244e-186 2.720057 0.840 0.017 3.446663e-182
寻找每个亚群的差异基因,保留上调的:
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25)
# 查看每个亚群的前两个
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 9.57e- 88 1.36 0.447 0.108 1.31e- 83 0 CCR7
## 2 3.75e-112 1.09 0.912 0.592 5.14e-108 0 LDHB
## 3 0 5.57 0.996 0.215 0 1 S100A9
## 4 0 5.48 0.975 0.121 0 1 S100A8
## 5 1.06e- 86 1.27 0.981 0.643 1.45e- 82 2 LTB
## 6 2.97e- 58 1.23 0.42 0.111 4.07e- 54 2 AQP3
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 5.61e-202 3.10 0.983 0.234 7.70e-198 4 CCL5
## 10 7.25e-165 3.00 0.577 0.055 9.95e-161 4 GZMK
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 3.13e-191 5.32 0.961 0.131 4.30e-187 6 GNLY
## 14 7.95e-269 4.83 0.961 0.068 1.09e-264 6 GZMB
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 1.92e-102 8.59 1 0.024 2.63e- 98 8 PPBP
## 18 9.25e-186 7.29 1 0.011 1.27e-181 8 PF4
指定特定的差异算法:
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0,
logfc.threshold = 0.25,
test.use = "roc",
only.pos = TRUE)
Marker 基因可视化:
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
使用 raw counts 数据:
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"),
slot = "counts", log = TRUE)
基因在聚类图上展现:
FeaturePlot(pbmc, features = c("MS4A1", "GNLY",
"CD3E", "CD14",
"FCER1A", "FCGR3A",
"LYZ", "PPBP",
"CD8A"))
marker 基因热图:
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
12细胞重命名
当我们根据每个亚群的 marker 基因表达,结合数据库注释信息及自己的生物学经验确定好亚群的细胞类群后,可以进行重命名:
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
# plot
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) +
NoLegend()
最后保存数据:
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
(微信交流群满200人后需收取20元入群费用)。
群二维码:
老俊俊微信:
知识星球:
所以今天你学习了吗?
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀Molecular Cell 文章 ribosome pausing 结果复现 (终)
◀Molecular Cell 文章 ribosome pausing 结果复现 (四)
◀Molecular Cell 文章 ribosome pausing 结果复现 (三)
◀..