Celltypist：超越singleR的单细胞注释工具

赵小明777 生信菜鸟团 2022-08-17

笔者最早接触Celltypist这个工具是看到生信技能树的帖子：使用 CellTypist 进行免疫细胞类型分类。去年还是预印本，没想到今年发在了science上《Cross-tissue immune cell analysis reveals tissue-specific features in humans》，DOI: 10.1126/science.abl5197.

去年到现在，我试用过很多注释方法，详见辅助单细胞注释的N种方法，总体来说这款软件注释的结果还是比较准确的。当然，再好的注释工具，还需要结合先验的细胞marker进行交叉验证。

随着在Science的发表，近期Celltypist迎来了重大更新，其中包括更新了一些参考数据集，下面我重点介绍一下Celltypist的环境配制和在R语言环境的使用实例。

Python用户可以参考Celltypist官网及刘老师的 CellTypist | Science 免疫细胞自动注释工具

一. `Celltypist`的环境配置

Celltypist的github在https://github.com/Teichlab/celltypist
作者还开发了一个网页工具（https://www.celltypist.org/），官网上作者还提供了各类细胞的marker。在线版本使用教程如下：

本地版的环境配置很简单，一句代码即可：

#conda create -n celltypist python=3.8
#conda activate celltypist

## 两种方法安装 pip or conda
#pip install celltypist
mamba install -c bioconda -c conda-forge celltypist

二. `Celltypist`参考数据集下载及介绍

旧版的Celltypist只有7个参考数据集，最新版本有13种。和singleR一样，参考数据集一般只需要下载一次，即可反复使用。

我们可以在R语言下载：

library(reticulate)
## 载入python模块
scanpy = import("scanpy")
celltypist = import("celltypist")
pandas <- import("pandas")
numpy = import("numpy")

#### 下载参考数据集
celltypist$models$download_models(force_update = T)

下载速度慢的可以去https://www.celltypist.org/models手动下载，这里我把已经13个参考数据集上传到网盘，后台回复单细胞 即可获取：

应该选择哪种参考数据集？

Cells_Fetal_Lung.pkl，人类胚胎和胎儿肺的146种细胞类型；
Cells_Intestinal_Tract.pkl，来自胎儿、儿童和成人肠道的134种肠道细胞类型；
Cells_Lung_Airway.pkl，来自人类肺部和呼吸道五个位置的scRNA的78种细胞群；
COVID19_Immune_Landscape.pkl，从COVID-19患者和健康对照者的肺和血液中提取的64种免疫亚型；
Developing_Mouse_Brain.pkl，从原肠形成到出生的小鼠胚胎大脑的174种细胞类型；
Healthy_COVID19_PBMC.pkl，健康和COVID-19个体的外周血51种单核细胞类型；
Human_Lung_Atlas.pkl，整合了46个人类呼吸系统数据集的人类肺细胞图谱（HLCA），共计58种细胞类型；
Immune_All_AddPIP.pkl，来自人的大于20个组织的免疫细胞类型(Immune_All_Low + Immune_All_PIP)；
mmune_All_High.pkl，来自19项研究的20个组织中的32个免疫细胞亚群；
Immune_All_Low.pkl，来自19项研究的20个组织中的90个免疫细胞亚群；
Immune_All_PIP.pkl，来自16个成人组织的41种免疫细胞类型；
Nuclei_Lung_Airway.pkl，来自人类肺和呼吸道五个位置的78个亚群
Pan_Fetal_Human.pkl，来自人类胎儿的138个基质和免疫亚群。

例如，对于用户需要注释免疫细胞类型，官网建议从“Immune_All_Low/High”模型开始，因为它们包含从不同组织收集的免疫细胞类型。“Low”表示低层次(高分辨率)细胞类型和子类型，“High”表示高层次(低分辨率)细胞类型。用户也可以尝试免疫细胞类型的扩展参考模型(“Immune_All_AddPIP”)。

三. 基于R语言的`Celltypist`注释实战

由于Celltypist依赖于Python环境，我们这里用reticulate包在R语言环境桥接Python，可参考通过R里面的reticulate包桥接使用Windows的conda

首先加载R包：

library(Seurat)
library(SeuratData)
library(ggplot2)
library(patchwork)
library(dplyr)
library(stringr)
library(readr)
library(cowplot)
library(reticulate)

Step1. 加载上述下载好的参考数据集

根据用户的需要，以及对各参考数据的理解，用户可选择加载合适的参考数据集，我们这里一次性加载所有的参考数据集做一个测试：

# 首先确认用户存放.pkl文件的位置，默认在
celltypist$models$celltypist_path

# 加载所有的参考数据集
model_type = list.files("~/.celltypist/data/models/") 
names(model_type) = str_split(string = model_type,pattern = "\\.", simplify = T)[,1]

model_list = lapply(model_type, function(x){
  celltypist$models$Model$load(model = x)})

head(model_list)

image-20220521152503983

Step2. 加载示例数据

这里我用之前已跑过标准的流程和整合分析的测试数据集，单细胞多样本整合之Harmony，LIGER和LISI

ifnb.data = read_rds("./ifnb.test.data.rds")
ifnb.data
# An object of class Seurat 
# 14053 features across 13999 samples within 1 assay 
# Active assay: RNA (14053 features, 2000 variable features)
# 4 dimensional reductions calculated: pca, harmony, umap, tsne

p0 = DimPlot(ifnb.data, reduction = "umap",label = T,label.box = T,group.by = "seurat_annotations")+ NoLegend();p0

Seurat对象转为celltypist所需要的对象：

####. 2.seurat to celltypist
adata = scanpy$AnnData(X = numpy$array(as.matrix(t(as.matrix(ifnb.data[['RNA']]@counts)))),
                       obs = pandas$DataFrame(sce@meta.data),
                       var = pandas$DataFrame(data.frame(gene = rownames(sce[['RNA']]@counts),
                                                         row.names = rownames(sce[['RNA']]@counts)))
)

scanpy$pp$normalize_total(adata, target_sum=1e4)
scanpy$pp$log1p(adata)

Step3.细胞亚群预测和可视化

根据上面对各个参考数据集的介绍，我这里先用Immune_All_High和Immune_All_Low两个参考数据集预测看看：

### 1. Immune_All_High
predictions = celltypist$annotate(adata, model = model_list[["Immune_All_High"]], majority_voting = T)
## 把这些信息加入到seurat对象中去
seurat.data  = AddMetaData(seurat.data, predictions$predicted_labels$majority_voting, col.name ="Immune_All_High") 

### 2. Immune_All_Low
predictions = celltypist$annotate(adata, model = model_list[["Immune_All_Low"]], majority_voting = T)
## 把这些信息加入到seurat对象中去
seurat.data  = AddMetaData(seurat.data, predictions$predicted_labels$majority_voting, col.name ="Immune_All_Low") 

### 3. 可视化
p1 = DimPlot(ifnb.data,group.by = "Immune_All_High", reduction = "umap", label = TRUE) 
p2 = DimPlot(ifnb.data,group.by = "Immune_All_Low", reduction = "umap", label = TRUE) 
p0 + p1 + p2

结果解读：

head(predictions$predicted_labels)

                    predicted_labels over_clustering          majority_voting
AAACATACATTTCC.1         Macrophages               9              Macrophages
AAACATACCAGAAA.1      CD16- NK cells              93              Macrophages
AAACATACCTCGCT.1           Monocytes              96              Macrophages
AAACATACCTGGTA.1          T(agonist)              54                      pDC
AAACATACGATGAA.1  Regulatory T cells              59 Tcm/Naive helper T cells
AAACATACGGCATT.1 Classical monocytes              71              Macrophages

predicted_labels包含预测标签、细胞过度聚类和（如果启用了多数投票）在多数投票方法之后的预测标签的主要结果：

contains the main result of predicted labels, cell over-clustering, and (if majority voting is enabled) predicted labels after the majority voting approach.

predictions$decision_matrix[1:5,1:5]

                    B cells CD16+ NK cells CD16- NK cells     CD8a/a CD8a/b(entry)
AAACATACATTTCC.1  -7.478285      -6.309951      -9.644588  -6.063277     -9.197812
AAACATACCAGAAA.1 -10.907189      -7.187142      -2.544067  -7.708966     -8.788600
AAACATACCTCGCT.1  -9.888353      -9.490017      -8.319425  -7.478427    -10.045900
AAACATACCTGGTA.1  -6.922568     -10.081439     -11.210727 -10.871817    -14.282324
AAACATACGATGAA.1  -5.262391     -10.407718      -3.569081  -9.539726     -9.002374

decision_matrix包含表示每个细胞跨细胞类型的决策得分的矩阵，用于确定最终每个细胞的细胞类型：

contains the matrix representing the decision scores for each cell across cell types, which is used to determine the ultimate predicted cell type of each cell.

predictions$probability_matrix[1:5,1:5]

                      B cells CD16+ NK cells CD16- NK cells       CD8a/a CD8a/b(entry)
AAACATACATTTCC.1 5.649064e-04   1.814823e-03   6.477096e-05 2.321363e-03  1.012505e-04
AAACATACCAGAAA.1 1.832568e-05   7.556756e-04   7.282605e-02 4.485838e-04  1.524380e-04
AAACATACCTCGCT.1 5.075989e-05   7.559709e-05   2.436766e-04 5.648265e-04  4.336131e-05
AAACATACCTGGTA.1 9.843261e-04   4.184739e-05   1.352811e-05 1.898548e-05  6.269967e-07
AAACATACGATGAA.1 5.156176e-03   3.019759e-05   2.740930e-02 7.193141e-05  1.231020e-04

probability_matrix包含表示每个细胞格属于给定细胞格类型的概率的矩阵(由sigmoid函数从决策矩阵转换而来)：

contains the matrix representing the probability each cell belongs to a given cell type (transformed from decision matrix by the sigmoid function).

Step4.写函数批量运行

这里我直接写了一个函数，用于模型的批量预测，以及可视化一步到位：

celltypist_vis = function(adata,
                          seurat_Data,
                          model = Immune_All_Low.pkl,
                          title = "Immune_All_Low"){
  ### 4.开始预测
  predictions = celltypist$annotate(adata, model = model, majority_voting = T)
  # predictions$predicted_labels %>% head()
  
  #### 5.把这些信息加入到seurat对象中去
  seurat_Data  = AddMetaData(seurat_Data , predictions$predicted_labels) 
  head(seurat_Data )
  
  seurat_Data  = SetIdent(seurat_Data , value = "majority_voting")
  p.umap = DimPlot(seurat_Data, reduction = "umap", label = TRUE, pt.size = 0.5,label.box = T,repel = T) + 
    NoLegend() + ggtitle(title)
  return(p.umap)
}

### 批量预测
plot.list = list()
for (i in 1:length(model_list)) {
  plot.list[[i]] = celltypist_vis(adata = adata,
                                  seurat_Data = ifnb.data,
                                  model = model_list[[i]],
                                  title = names(model_list[i]))  
}

p.ct = wrap_plots(plot.list,ncol = 4) + p0
ggsave(p.ct, filename = "Outplot/annotation_celltypist.jpeg",
       width = 20,height = 20)

因为这里用了所有的参考数据集，所以结果会有点多。总体来说，Celltypist的预测结果还行，至少远准确于SingleR。为了提高软件的精确度，我建议大家：

一是，根据自己的数据选择合适的参考数据集；
二是，需要用marker对预测结果进行交叉验证。

- END -

看到满街情色，我却感觉悲凉

张维迎：一切灾难，都来自于多数人的无知和少数人的无耻

“她是个勾八”！知名神豪怒撕主播前女友！手握大料：发出来她就别播了

我只会穿的少跳S舞！舞蹈一姐转往某音人气雪崩！大哥、粉丝全被吸跑！处境艰难！

低调的胡海峰同志！

Celltypist：超越singleR的单细胞注释工具

一. `Celltypist`的环境配置

二. `Celltypist`参考数据集下载及介绍

三. 基于R语言的`Celltypist`注释实战

Step1. 加载上述下载好的参考数据集

Step2. 加载示例数据

Step3.细胞亚群预测和可视化

Step4.写函数批量运行

您可能也对以下帖子感兴趣

看到满街情色，我却感觉悲凉

张维迎：一切灾难，都来自于多数人的无知和少数人的无耻

“她是个勾八”！知名神豪怒撕主播前女友！手握大料：发出来她就别播了

我只会穿的少跳S舞！舞蹈一姐转往某音人气雪崩！大哥、粉丝全被吸跑！处境艰难！

低调的胡海峰同志！

生成图片，分享到微信朋友圈

Celltypist：超越singleR的单细胞注释工具

一. Celltypist的环境配置

二. Celltypist参考数据集下载及介绍

三. 基于R语言的Celltypist注释实战

Step1. 加载上述下载好的参考数据集

Step2. 加载示例数据

Step3.细胞亚群预测和可视化

Step4.写函数批量运行

您可能也对以下帖子感兴趣

一. `Celltypist`的环境配置

二. `Celltypist`参考数据集下载及介绍

三. 基于R语言的`Celltypist`注释实战