单细胞入门之细胞类型鉴定
💡专注R语言在🩺生物医学中的使用
今天主要学习细胞类型鉴定的各种常见方法。
不同类型的细胞、不同阶段的细胞等,表达的基因是有特异性的,可以根据某些特定表达的基因来推测细胞类型。
单细胞很多后续的分析都是基于感兴趣的细胞亚群进行的,所以细胞亚群注释就显得尤为重要!
Marker基因鉴定细胞类型
主要是通过一些数据库和自己阅读文献积累的不同细胞类型的marker进行注释。记住常见的marker,看一眼就能分辨出常见的细胞类型!
一些常见的marker基因数据库:
CellMarker: http://xteam.xbio.top/CellMarker/ PanglaoDB: https://panglaodb.se/index.html HCL: https://db.cngb.org/HCL/# MCL: http://bis.zju.edu.cn/MCA/index.html Single Cell Expression Atlas: https://www.ebi.ac.uk/gxa/sc/home HumanCellAtlas: https://data.humancellatlas.org/ CancerSEA: http://biocc.hrbmu.edu.cn/CancerSEA/
除此之外,还可以通过读文献自己总结不同细胞类型的marker基因。
可以通过ReanmeIdents()
非常方便的把你注释好的细胞类型写入到seurat
对象中,这样细胞亚群的名字就不再是0,1,2,3...这些数字了!
下面是一个简单的小例子。
library(Seurat)
## Attaching SeuratObject
## Attaching sp
library(SingleR)
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:sp':
##
## %over%
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
## Attaching package: 'SummarizedExperiment'
## The following object is masked from 'package:SeuratObject':
##
## Assays
## The following object is masked from 'package:Seurat':
##
## Assays
rm(list = ls())
immune.combined <- readRDS(file = "../000files/immune_combined.rds")
DefaultAssay(immune.combined) <- "RNA"
immune.combined # 假设这里面类群名字现在是0,1,2,3...这些数字
## An object of class Seurat
## 16053 features across 13999 samples within 2 assays
## Active assay: RNA (14053 features, 0 variable features)
## 1 other assay present: integrated
## 2 dimensional reductions calculated: pca, umap
看看现在细胞亚群的名字,都是数字:
table(immune.combined$seurat_clusters)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 4321 2499 1819 1152 1072 770 629 619 489 220 175 108 56 46 24
我们手动改一下细胞亚群的名字:
immune.combined <- RenameIdents(immune.combined,
'0'='type1',
'1'='type2',
'2'='type3',
'3'='type4',
'4'='type5',
'5'='type6',
'6'='type7',
'7'='type8',
'8'='type9',
'9'='type10',
'10'='type11',
'11'='type12',
'12'='type13',
'13'='type14',
'14'='type15'
)
immune.combined$seurat_clusters <- Idents(immune.combined)
把上面这个改好的seurat
对象保存下就好了~但是上面这种命名是没有意义的,实际上一定是根据某些基因的特异性表达来确定到底是哪种细胞类型的。
DimPlot(immune.combined,reduction = "umap",label = T)
是不是很简单?手动注释可操作性更强,如果你记得常见的细胞类型的Marker,手动注释可能是更好的选择~
使用R包进行细胞类型注释
比较常用的是SingleR
。
安装:
# 又是一个巨大的R包!
install.packages("BiocManager")
BiocManager::install("SingleR", version = "devel")
install.packages("devtools")
library(devtools)
install_github("LTLA/SingleR")
下载SingleR
数据集,很大可能你会下载失败,这时候你可以去单细胞天地搜索singleR数据库文件,文中有介绍如何搞定这7个数据集。
# 需要安装这个包
# BiocManager::install("celldex"),数据集目前在这个包里面
library(celldex)
# 下载之后保存,下次直接读取使用。
ref <- celldex::HumanPrimaryCellAtlasData() # 使用HumanPrimaryCellAtlasData()函数加载参考数据集
saveRDS(ref,file="./HumanPrimaryCellAtlas.rdata")
ref <- celldex::BlueprintEncodeData()
saveRDS(ref,file="./BlueprintEncode.rdata") #
ref <- celldex::DatabaseImmuneCellExpressionData()
saveRDS(ref,file="../000files/DatabaseImmuneCellExpression.rdata")
ref <- celldex::MonacoImmuneData()
saveRDS(ref,file="./MonacoImmune.rdata")
ref <- celldex::NovershternHematopoieticData()
saveRDS(ref,file="./NovershternHematopoietic.rdata")
ref <- celldex::MouseRNAseqData() # (鼠)
saveRDS(ref,file="./MouseRNAseq.rdata")
ref <- celldex::ImmGenData() # (鼠)
saveRDS(ref,file="./ImmGen.rdata")
下面是实操部分,就用之前得到的immune_combined.rds
。
library(Seurat)
library(SingleR)
rm(list = ls())
immune.combined <- readRDS(file = "../000files/immune_combined.rds")
DefaultAssay(immune.combined) <- "RNA"
immune.combined
## An object of class Seurat
## 16053 features across 13999 samples within 2 assays
## Active assay: RNA (14053 features, 0 variable features)
## 1 other assay present: integrated
## 2 dimensional reductions calculated: pca, umap
你看这是没有注释之前的图:
DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
下面我们用SingleR
包注释细胞类型。
immune_singler <- GetAssayData(immune.combined,slot = "data") # 提取表达矩阵
clusters <- immune.combined$seurat_clusters # 获取现在的细胞类群名字
table(clusters)
## clusters
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 4321 2499 1819 1152 1072 770 629 619 489 220 175 108 56 46 24
加载保存好的ref_Human_all.RData
。
load(file = "../000files/SingleR_ref/ref_Human_all.RData")
用SingleR
包注释:
pred.immune <- SingleR(test = immune_singler, #你的表达矩阵
ref = ref_Human_all, # 你的注释文件
labels = ref_Human_all$label.fine,
#因为样本主要为免疫细胞(而不是全部细胞),因此设置为label.fine
method = "cluster",
clusters = clusters)
table(pred.immune$labels)
##
## B_cell:immature
## 1
## B_cell:Naive
## 1
## CMP
## 1
## Macrophage:monocyte-derived:M-CSF/IFNg
## 1
## Macrophage:monocyte-derived:M-CSF/IFNg/Pam3Cys
## 2
## Monocyte:anti-FcgRIIB
## 2
## NK_cell:IL2
## 1
## T_cell:CD4+
## 1
## T_cell:CD4+_central_memory
## 2
## T_cell:CD4+_Naive
## 1
## T_cell:CD8+
## 1
## T_cell:CD8+_naive
## 1
可以看到有些是没有注释出来的~
查看注释结果:
plotScoreHeatmap(pred.immune, clusters = pred.immune$labels)
注释好之后替换原有细胞类群名称
new.clusterID <- pred.immune$labels
names(new.clusterID) <- levels(immune.combined)
immune.combined.new <- RenameIdents(immune.combined,new.clusterID)
再画图就是新的名字了。
DimPlot(immune.combined.new, reduction = "umap", label = TRUE, repel = TRUE)
细胞类型鉴定还有非常多其他方法,大家选择合适的即可~
参考资料
生信技能树、单细胞天地 菲沙基因单细胞培训课程
获取更多信息,欢迎加入🐧QQ交流群:613637742
“医学和生信笔记,专注R语言在临床医学中的使用、R语言数据分析和可视化。主要分享R语言做医学统计学、meta分析、网络药理学、临床预测模型、机器学习、生物信息学等。
往期回顾
dplyr中的行操作
dplyr中的across操作
dplyr强大的分组汇总
使用dplyr进行数据分析:入门篇
超详细教程:修改ggplot2图例