《拟时序分析》5.monocle3的降维、分群、聚类
上次教程已经带领大家先简单的认识一下monocle3的特点与功能(单细胞测序数据进阶分析—《拟时序分析》4.初识monocle3)。虽然是录了五十分钟,但是鉴于monocle3的坑还是很多,所以每一部分的功能还是需要带大家展开了解一下。这次内容主要带领大家学习monocle3的降维、分群和聚类功能,不知道大家是更喜欢用Seurat还是这次的monocle3呢?
视频中提到的Rmakdown还没有制作完,大家不要着急,我们先把这节课的测试文件与代码放在这个链接中,monocle3系列课程制作完毕会再进行更新,请大家持续关注:
代码及测试文件链接:https://pan.baidu.com/s/16lPDInnenycR9sTcABolSg?pwd=tdp7你也可以直接前往monocle3官网教程进行学习:
https://cole-trapnell-lab.github.io/monocle3/docs/differential/
视频教程总是上传失败,大家直接去B站看吧:
https://www.bilibili.com/video/BV1br4y1x7Hf?p=9
图文教程:
2.1.monocle内置
2.1.1.降维
rm(list = ls());gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 9994083 533.8 27726538 1480.8 27726538 1480.8
## Vcells 18353825 140.1 158881073 1212.2 198600815 1515.3
library(monocle3)library(dplyr)cds <- readRDS('test.data/simple.rds')#怎么构建cds就不重复了,直接读一下上一部分的cds <- preprocess_cds(cds, num_dim = 100)#主要是完成normalize,上一个视频刚讲过plot_pc_variance_explained(cds)
cds <- reduce_dimension(cds)#降维,相当于Seurat::RunUMAP()
## No preprocess_method specified, using preprocess_method = 'PCA'
#这里我们没有指定reduction_method,但是默认是"UMAP",有"UMAP", "tSNE", "PCA", "LSI", "Aligned"可供选择 #一般大于10000个细胞认为是大数据集,这时可以用 umap.fast_sgd=TRUE
umap.fast_sgd
参数或者设置cores
可以加快运算速度,但是可以能会让降维图最终显得有些不同,所以要慎重选择
red.1 <- plot_cells(cds, color_cells_by="cao_cell_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)#类似于Seurat::DimPlot()## No trajectory to plot. Has learn_graph() been called yet?cds <- reduce_dimension(cds,umap.fast_sgd=T)## No preprocess_method specified, using preprocess_method = 'PCA'## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'red.2 <- plot_cells(cds, color_cells_by="cao_cell_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)## No trajectory to plot. Has learn_graph() been called yet?#这些参数的默认值都太小了,不利于我们观察,每次要敲很多参数,这就是我吐槽monocle3代码不友好的地方cds <- reduce_dimension(cds,cores =5)## No preprocess_method specified, using preprocess_method = 'PCA'## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'red.3 <- plot_cells(cds, color_cells_by="cao_cell_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)## No trajectory to plot. Has learn_graph() been called yet?suppressMessages(library(patchwork))red.1|red.2|red.3#线程对降维的影响似乎要小一些## Warning: Removed 1 rows containing missing values (geom_text_repel).## Removed 1 rows containing missing values (geom_text_repel).## Removed 1 rows containing missing values (geom_text_repel).## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider## increasing max.overlaps## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider## increasing max.overlaps## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider## increasing max.overlaps
试试tSNE
cds <- reduce_dimension(cds, reduction_method="tSNE")
## No preprocess_method specified, using preprocess_method = 'PCA'
plot_cells(cds, color_cells_by="cao_cell_type", reduction_method="tSNE")#给大家看看默认效果多让人抓狂
## No trajectory to plot. Has learn_graph() been called yet?
## Warning: Removed 1 rows containing missing values (geom_text_repel).
#看看UMAP和tSNE的差别吧red.1|plot_cells(cds, reduction_method="tSNE", color_cells_by="cao_cell_type", cell_size = 2, group_label_size = 3)
## No trajectory to plot. Has learn_graph() been called yet?
## Warning: Removed 1 rows containing missing values (geom_text_repel).
## Removed 1 rows containing missing values (geom_text_repel).
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
2.1.2.分群
cds <- cluster_cells(cds)#分群#这里计算方式用的是community detection,类似于Seurat::FindClusters(),但原理有所不同cds <- cluster_cells(cds, resolution=0.1)#貌似现在只能选择分辨率,不能选择分群数量了,我觉得还挺好的一个功能没了unique(partitions(cds))#查看分群ID
## [1] 1 3 2
## Levels: 1 2 3
plot_cells(cds)
## No trajectory to plot. Has learn_graph() been called yet?
plot_cells(cds, genes=c("cpna-2", "egl-21", "ram-2", "inos-1"))#看看基因在降维图中的表达量吧
## No trajectory to plot. Has learn_graph() been called yet?
#相当于Seurat::FeaturePlot()
2.1.3.去批次
names(cds@colData)#用于去批次的变量从这里挑
## [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue"
## [5] "Size_Factor"
bfalig <- plot_cells(cds, color_cells_by="plate", label_cell_groups=FALSE)
## No trajectory to plot. Has learn_graph() been called yet?
cds <- align_cds(cds, alignment_group = "plate")#去除批次效应
## Aligning cells from different batches using Batchelor.
## Please remember to cite:
## Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). 'Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.' Nat. Biotechnol., 36(5), 421-427. doi: 10.1038/nbt.4091
aftalig <- plot_cells(cds, color_cells_by="plate", label_cell_groups=FALSE)
## No trajectory to plot. Has learn_graph() been called yet?
bfalig|aftalig#看看效果
2.1.4.计算并展示marker基因
#计算marker基因,类似于Seurat::FindAllMarkers()marker_test_res <- top_markers(cds, group_cells_by="cao_cell_type", reference_cells=1000, cores=2)
marker_test_res[1:4,1:4]
## gene_id gene_short_name cell_group marker_score
## 1 WBGene00000004 aat-3 GABAergic neurons 0.4279421
## 2 WBGene00000016 abf-5 Pharyngeal muscle 0.3333333
## 3 WBGene00000031 abu-8 Pharyngeal epithelia 0.3269503
## 4 WBGene00000037 ace-3 Canal associated neurons 0.4750789
top_specific_markers <- marker_test_res %>% filter(fraction_expressing >= 0.10) %>% group_by(cell_group) %>% top_n(1, pseudo_R2)top_specific_markers[1:4,1:4]
## # A tibble: 4 × 4
## # Groups: cell_group [4]
## gene_id gene_short_name cell_group marker_score
## <chr> <fct> <chr> <dbl>
## 1 WBGene00000485 che-3 Dopaminergic neurons 0.591
## 2 WBGene00000681 col-107 Seam cells 0.326
## 3 WBGene00000753 col-180 Non-seam hypodermis 0.355
## 4 WBGene00001172 egl-3 Pharyngeal neurons 0.297
top_specific_marker_ids <- unique(top_specific_markers %>% pull(gene_id))plot_genes_by_group(cds, top_specific_marker_ids[1:10], group_cells_by="cao_cell_type", max.size=3)#相当于Seurat::DotPlot
## Warning in if (ordering_type == "cluster_row_col") {: the condition has length >
## 1 and only the first element will be used
## Warning in stats::cor(t(res)): the standard deviation is zero
## Warning in if (axis_order == "marker_group") {: the condition has length > 1 and
## only the first element will be used
plot_genes_by_group(cds, top_specific_marker_ids[1:10], max.size=3, group_cells_by="partition")#根据cluster分组绘制
## Warning in if (ordering_type == "cluster_row_col") {: the condition has length >
## 1 and only the first element will be used
## Warning in if (ordering_type == "cluster_row_col") {: the condition has length >
## 1 and only the first element will be used
2.1.5.注释细胞
colData(cds)$assigned_cell_type <- as.character(partitions(cds))colData(cds)$assigned_cell_type <- dplyr::recode(colData(cds)$assigned_cell_type, "1"="Bio_1", "2"="Bio_2", "3"="Bio_3", '4'='Bio_4')#值得表扬的是没有像Seurat一样从0开始编号plot_cells(cds, group_cells_by="partition", color_cells_by="assigned_cell_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)
## No trajectory to plot. Has learn_graph() been called yet?
2.1.6.取子集以及亚群分析
try(cds_subset <- choose_cells(cds))#交互式的圈出一些细胞用于下游分析
## Error : choose_cells only works in interactive mode.
cds_subset <- cds #这里我本来细胞数就比较少就不取了class(cds_subset)
## [1] "cell_data_set"
## attr(,"package")
## [1] "monocle3"
pr_graph_test_res <- graph_test(cds_subset, neighbor_graph="knn", cores=2)
通过marker基因集可视化帮助判断
#专门用来分析亚群的marker genepr_deg_ids <- row.names(subset(pr_graph_test_res, morans_I > 0.01 & q_value < 0.05))gene_module_df <- find_gene_modules(cds_subset[pr_deg_ids,], resolution=1e-3)#Seurat::AddModuleScore()gene_module_df[1:5,1:5]
## # A tibble: 5 × 5
## id module supermodule dim_1 dim_2
## <chr> <fct> <fct> <dbl> <dbl>
## 1 WBGene00000002 11 5 1.19 2.30
## 2 WBGene00000003 2 2 9.66 -2.55
## 3 WBGene00000004 5 3 4.72 0.633
## 4 WBGene00000006 2 2 10.7 -2.83
## 5 WBGene00000010 3 1 -5.50 -2.16
plot_cells(cds_subset, genes=gene_module_df, show_trajectory_graph=FALSE, label_cell_groups=FALSE, cell_size = 0.5, graph_label_size = 5, group_label_size = 3)#相当于基因集打分
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
2.2.Garnett
这个功能旨在帮助大家自动注释细胞
2.2.1.利用Marker制作Garnett文件
assigned_type_marker_test_res <- top_markers(cds, group_cells_by="assigned_cell_type", reference_cells=1000, cores=2)
garnett_markers <- assigned_type_marker_test_res %>% filter(marker_test_q_value < 0.01 & specificity >= 0.5) %>% group_by(cell_group) %>% top_n(5, marker_score)garnett_markers <- garnett_markers %>% group_by(gene_short_name) %>% filter(n() == 1)garnett_markers[1:5,1:5]
## # A tibble: 5 × 5
## # Groups: gene_short_name [5]
## gene_id gene_short_name cell_group marker_score mean_expression
## <chr> <fct> <chr> <dbl> <dbl>
## 1 WBGene00000063 act-1 Bio_1 0.750 3.39
## 2 WBGene00001172 egl-3 Bio_2 0.229 1.88
## 3 WBGene00001189 egl-21 Bio_2 0.212 1.39
## 4 WBGene00002048 ida-1 Bio_2 0.291 2.14
## 5 WBGene00003369 mlc-1 Bio_1 0.738 1.41
generate_garnett_marker_file(garnett_markers, file="./marker_file.txt")
## Garnett marker file written to ./marker_file.txt
2.2.2.利用先验Marker制作Garnett文件
#如果你有一些先验的marker,也可以自行修改Garnett文件#目前Garnett可以支持monocle2和monocle3if(!require(org.Mm.eg.db))BiocManager::install( c("org.Mm.eg.db", "org.Hs.eg.db","org.Ce.eg.db"))
## Warning: package 'AnnotationDbi' was built under R version 4.1.1
if(!require(garnett))devtools::install_github("cole-trapnell-lab/garnett", ref="monocle3")unique(clusters(cds))
## [1] 6 1 4 3 5 7 2 8 9
## Levels: 1 2 3 4 5 6 7 8 9
colData(cds)$garnett_cluster <- clusters(cds)#根据marker基因注释细胞worm_classifier <- train_cell_classifier(cds = cds, marker_file = "./marker_file.txt", db=org.Ce.eg.db::org.Ce.eg.db, cds_gene_id_type = "ENSEMBL", num_unknown = 50, #防止被错误的分配,可以输入unknown的数量 marker_file_gene_id_type = "SYMBOL", cores=4)#读取并构建garnett注释
## Warning in make_name_map(parse_list,
## as.character(row.names(rowData(norm_cds))), : 1 genes could not be converted
## from SYMBOL to ENSEMBL These genes are listed below:
## Warning in make_name_map(parse_list, as.character(row.names(rowData(norm_cds))), : The following genes from the cell type definition file are not presentin the cell dataset. Please check these genes for errors. Cell typedetermination will continue, ignoring these genes.
## tag-80
## training_sample
## Cell type Bio_1 Cell type Bio_2 Cell type Bio_3 Unknown
## 118 106 64 50
rownames(cds)[1:5]
## [1] "WBGene00000001" "WBGene00000002" "WBGene00000003" "WBGene00000004"
## [5] "WBGene00000005"
class(worm_classifier)#如果你对一个组织器官比较了解,可以给出一些先验的marker
## [1] "garnett_classifier"
## attr(,"package")
## [1] "garnett"
cds <- classify_cells(cds, worm_classifier, db = org.Ce.eg.db::org.Ce.eg.db, cluster_extend = TRUE, cds_gene_id_type = "ENSEMBL")#把garnett的注释加进来unique(colData(cds)$garnett_cluster)
## [1] 6 1 4 3 5 7 2 8 9
## Levels: 1 2 3 4 5 6 7 8 9
unique(colData(cds)$cluster_ext_type)
## [1] "Cell type Bio_1" "Cell type Bio_3" "Unknown" "Cell type Bio_2"
plot_cells(cds,show_trajectory_graph = F, group_cells_by="partition", color_cells_by="cluster_ext_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)|plot_cells(cds,show_trajectory_graph = F, group_cells_by="partition", color_cells_by="assigned_cell_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)
拟时序系列教程
B站连续播放起来比较方便:
https://www.bilibili.com/video/BV1br4y1x7Hf?p=1往期推送单细胞测序数据进阶分析—《拟时序分析》1.概论
单细胞测序数据进阶分析—《拟时序分析》2.monocle概论单细胞测序数据进阶分析—《拟时序分析》3.monocle2实操:完整版单细胞测序数据进阶分析—《拟时序分析》4.初识monocle3
往期回顾
单细胞数据分析系列教程:
B站视频,先看一遍视频再去看推送操作,建议至少看三遍:https://www.bilibili.com/video/BV1S44y1b76Z/单细胞测序基础数据分析保姆级教程,代码部分整理在往期推送之中:手把手教你做单细胞测序数据分析(一)——绪论
手把手教你做单细胞测序数据分析(二)——各类输入文件读取
手把手教你做单细胞测序数据分析(三)——单样本分析
手把手教你做单细胞测序数据分析(四)——多样本整合手把手教你做单细胞测序数据分析(五)——细胞类型注释
手把手教你做单细胞测序数据分析(六)——组间差异分析及可视化手把手教你做单细胞测序数据分析(七)——基因集富集分析
上游fastq文件处理:
单细胞分析的最上游——处理Fastq文件:cellranger
单细胞分析的最上游——处理Fastq文件:dropseqRunner有奖提问,这个smart-Seq2数据实战的比对率为何如此低?
细胞通讯B站连续播放起来比较方便:
https://www.bilibili.com/video/BV1Ab4y1W7qx?p=1往期推送单细胞测序数据进阶分析—《细胞通讯》1.概论
单细胞测序数据进阶分析—《细胞通讯》2.1CellChat基础分析教程
单细胞测序数据进阶分析—《细胞通讯》2.2CellChat多组别分析
给你们申请到了免费的128GB云服务器
拟时序分析B站连续播放起来比较方便:
https://www.bilibili.com/video/BV1br4y1x7Hf?p=1往期推送单细胞测序数据进阶分析—《拟时序分析》1.概论
单细胞测序数据进阶分析—《拟时序分析》2.monocle概论单细胞测序数据进阶分析—《拟时序分析》3.monocle2实操:完整版
单细胞测序数据进阶分析—《拟时序分析》4.初识monocle3
其他单细胞相关技术贴也在这里:细胞的数量由誰决定?单细胞中应该如何做GSVA?答读者问(三):单细胞测序前景答读者问(四):如何分析细胞亚群答读者问(六):Seurat中如何让细胞听你指挥答读者问(八):为什么Read10X也会报错?
答读者问(十)整合后的表达矩阵,如何拆分出分组信息?
答读者问(十一)如何一次性读取一个目录下的cellranger输出文件?
给你安排一个懂生信的工具人(十):不学编程 零代码完成单细胞测序数据分析:Loupe Browser
什么?不做单细胞也能分析细胞类群和免疫浸润?
答读者问 (十三)查看Seurat对象时的ERROR:type='text'
各类单细胞对象(数据格式)转换大全(一)批量整理好GEO中下载的单细胞数据
答读者问 (十四)Seurat中分类变量处理技巧
答读者问 (十五)稀疏矩阵转matrix, as.matrix函数是下下策
答读者问 (十六)做单细胞测序到底需要多少内存答读者问 (十七)调用的线程越多就算的越快嘛?
单细胞文献阅读:文献阅读(三)、单细胞测序解析糖尿病肾病中肾小球的动态变化
文献阅读(四)、单细胞测序技术解析健康人与T2D患者的胰岛差异文献阅读(六)、小鼠全肾单细胞测序开篇之作文献阅读(七)、一篇不花钱就能白嫖的文章文献阅读(八)、不会吧不会吧,Nature都能白嫖?文献阅读(十一)、高氧下小鼠肺发育损伤的ScRNA图谱文献阅读(十二)、IgAN & STRT-Seq
文献阅读(十三)、老树开新花——EGFR、肿瘤、免疫+scRNA-Seq
文献阅读(十五)、癌前基质细胞驱动BRCA1肿瘤发生文献阅读(十八)、紧跟生信"钱"沿,胰腺癌&免疫多模态图谱
非技术帖:关于单细胞的事 谈谈后面的计划趁机预告一波有关提高"Biomamba 生信基地"运行效率事宜
独享服务器2.0即将上线,128GB免费试用生信分析为什么要使用服务器?
如何联系我们
大家可以阅读完这几篇之后添加笑一笑也就算了如何搜索公众号过往发布内容