解决SCENIC流程的多线程问题
去年我们在《生信技能树》公众号带领大家一起学习过:SCENIC转录因子分析结果的解读 ,而且前些天我给了一个 单细胞转录因子分析之SCENIC流程:,但是单细胞各种交流群很多朋友反映自己follow我们的教程,出现了如下的报错:
Error in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores, :
Valid 'mctype': 'snow' or 'doMC'
一般来说,遇到了这样的报错说明其实并没有follow我的教程,而是follow的官方教程。因为我自己就是跟着官方文档跑的时候报错了,才进行了一些小的修改,主要是多线程问题,让我一一道来。
因为初始化的时候设置多线程
### Initialize settings
library(SCENIC)
db='cisTarget_databases/'
list.files(db)
# 保证cisTarget_databases 文件夹下面有下载好2个1G的文件
scenicOptions <- initializeScenic(org="hgnc",
dbDir=db , nCores=4)
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
saveRDS(cellInfo, file="int/cellInfo.Rds")
可以看到前面的 nCores=4 参数,告诉我们的电脑,可以开启4个线程。
runGenie3可以正常多线程
开启多线程,速度超级快!
> runGenie3(exprMat_filtered_log, scenicOptions)
Using 1419 TFs as potential regulators...
Running GENIE3 part 1
Running GENIE3 part 10
Running GENIE3 part 2
Running GENIE3 part 3
Running GENIE3 part 4
Running GENIE3 part 5
Running GENIE3 part 6
Running GENIE3 part 7
Running GENIE3 part 8
Running GENIE3 part 9
Finished running GENIE3.
本来呢,2000多个细胞走这个步骤,需要24小时左右,但是加上了4个线程,一天一夜的时间消耗就变成了一个晚上。
但是runSCENIC_3_scoreCells多线程失败
虽然跑前面的runGenie3可以正常多线程,节省了大量的时间,但是后面runSCENIC_3_scoreCells多线程失败。
报错如下:
> scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log )
20:54 Step 3. Analyzing the network activity in each individual cell
Number of regulons to evaluate on cells: 81
Biggest (non-extended) regulons:
SOX4 (252g)
STAT1 (141g)
JUN (99g)
TAF7 (99g)
SPI1 (90g)
RFX3 (78g)
IRF9 (65g)
FOS (54g)
RFX2 (54g)
YY1 (54g)
Quantiles for the number of genes detected by cell:
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
min 1% 5% 10% 50% 100%
607.00 700.72 1116.20 1721.60 8015.00 12772.00
Using 4 cores.
Error in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores, :
Valid 'mctype': 'snow' or 'doMC'
应该是 snow 或者 doMC 这两个包的一些函数更新了,但是SCENIC流程的R代码里面并没有及时更新。(原则上我们应该是去看源代码,解决它,然后去SCENIC流程的R代码的GitHub上面提交一个issue,不过因为时间关系,我们就到此为止啦。)
而且实际上这个runSCENIC_3_scoreCells步骤并不是计算量很大,无需设置多线程,所以可以重新修改为 nCores=1,代码如下:
scenicOptions <- initializeScenic(org="hgnc",
dbDir=db , nCores=1)
然后就顺利运行了:
> scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log )
21:01 Step 3. Analyzing the network activity in each individual cell
Number of regulons to evaluate on cells: 81
Biggest (non-extended) regulons:
SOX4 (252g)
STAT1 (141g)
JUN (99g)
TAF7 (99g)
SPI1 (90g)
RFX3 (78g)
IRF9 (65g)
FOS (54g)
RFX2 (54g)
YY1 (54g)
Quantiles for the number of genes detected by cell:
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
min 1% 5% 10% 50% 100%
607.00 700.72 1116.20 1721.60 8015.00 12772.00
21:02 Finished running AUCell.
21:02 Plotting heatmap...
21:02 Plotting t-SNEs...
是不是超级简单啊!
整理好的全部代码是:
我把前面的代码重新整理了一下,如果你有构建好的单细胞seurat对象,就可以直接走下面的代码:
rm(list = ls())
library(Seurat)
load(file = 'Epithelial_recluster_sce.Rdata')
sce
phe=sce@meta.data
mat=sce@assays$RNA@counts
mat[1:4,1:4]
exprMat =as.matrix(mat)
dim(exprMat)
exprMat[1:4,1:4]
head(phe)
cellInfo <- phe[,c('seurat_clusters','nCount_RNA' ,'nFeature_RNA' )]
colnames(cellInfo)=c('CellType', 'nGene' ,'nUMI')
head(cellInfo)
table(cellInfo$CellType)
### Initialize settings
library(SCENIC)
db='cisTarget_databases/'
list.files(db)
# 保证cisTarget_databases 文件夹下面有下载好2个1G的文件
scenicOptions <- initializeScenic(org="hgnc",
dbDir=db , nCores=4)
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
saveRDS(cellInfo, file="int/cellInfo.Rds")
### Co-expression network
genesKept <- geneFiltering(exprMat, scenicOptions)
exprMat_filtered <- exprMat[genesKept, ]
exprMat_filtered[1:4,1:4]
dim(exprMat_filtered)
runCorrelation(exprMat_filtered, scenicOptions)
exprMat_filtered_log <- log2(exprMat_filtered+1)
runGenie3(exprMat_filtered_log, scenicOptions)
### Build and score the GRN
exprMat_log <- log2(exprMat+1)
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # Toy run settings
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions,
coexMethod=c("top5perTarget")) # Toy run settings
library(doParallel)
scenicOptions <- initializeScenic(org="hgnc",
dbDir=db , nCores=1)
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log )
scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
tsneAUC(scenicOptions, aucType="AUC") # choose settings
export2loom(scenicOptions, exprMat)
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
完美而且快速的走完了这个SCENIC流程。
我前面在教程 单细胞转录因子分析之SCENIC流程提到的两个解决方案,第一个是对细胞亚群里面的单细胞进行抽样,第二个是 Python (pySCENIC). 教程,开启多线程!其实在R里面也算是解决了一半,我目前还没有去测试,在R里面跟python里面,到底是速度有啥差异,如果都开启了多线程的话。