其他
一文搞定GSVA及下游分析的代码实现
Biomamba
2022/1/8
之前我们已经详细解释过GSVA的原理:文献阅读系列(九)、一文了解GSVA原理
也给大家提供了GSVA的最快速实现方式:三行搞定GSVA分析
那么本文主要来教大家如何真正的实现GSVA以及GSVA得到的结果又如何进行下游分析
先进行一个最简单的gsva分析吧,有助于大家理解这个过程
suppressMessages(if (!require(BiocManager))install.packages("BiocManager"))
## Warning: package 'BiocManager' was built under R version 4.0.5
suppressMessages(if (!require(GSVA)) install.packages("GSVA"))
#模拟一个行名为基因名,列名为样本名的符合 random Gaussian 分布的表达矩阵
p <- 10000 ## number of genes
n <- 30 ## number of samples
## simulate expression values from a standard Gaussian distribution
X <- matrix(rnorm(p*n), nrow=p,
dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
X[1:5, 1:5]
## s1 s2 s3 s4 s5
## g1 0.1129431 0.3632631 -0.8604533 -0.84809655 -0.2887233
## g2 -0.3304489 -0.9449261 -0.7237087 -0.39023436 1.2794646
## g3 -0.8851599 0.8987476 -0.2370303 -0.38613025 0.4622897
## g4 0.1110591 0.2732632 0.9372827 0.04282367 0.4374966
## g5 -0.1576834 0.2374519 0.7612836 1.59471996 -0.8784173
##### 随机生成基因集 ######
gs <- as.list(sample(10:100, size=100, replace=TRUE))
## sample gene sets
gs <- lapply(gs, function(n, p)
paste0("g", sample(1:p, size=n, replace=FALSE)), p)
names(gs) <- paste0("gs", 1:length(gs))
class(gs)#也可以是一个列名是基因集名称,填充值为基因名的数据框
## [1] "list"
##### 完成一个最简单的gsva分析 #####
suppressMessages(gsva.es <- gsva(X, gs, verbose=FALSE))
dim(gsva.es)
## [1] 100 30
gsva.es[1:5, 1:5]
## s1 s2 s3 s4 s5
## gs1 0.049271334 -0.25022473 -0.21006688 0.0753243233 -0.05397171
## gs2 0.007766851 0.08620726 -0.12061422 -0.0676672331 -0.13035776
## gs3 0.032419529 -0.07774921 0.06682178 -0.0007734805 -0.04927616
## gs4 -0.119463196 0.03506603 0.05580050 0.1091778239 -0.08210213
## gs5 -0.125469410 0.06517393 -0.16393730 -0.2221700663 -0.07559757
也就是说,你准备好了表达矩阵和基因集之后,只要简单的一行命令就可以完成GSVA分析
GSVA用于大量的数据集计算也比较节省计算时间,得到的GSVA值也可以用于下游的分析如:
1、差异分析 differential expression
2、分类分析classification(机器学习)
3、生存分析 survival analysis
4、聚类或相关性分析 clustering/correlation analysis
5、拷贝数变异分析 copy-number variation (CNV) data
6、单核苷酸多态性 single nucleotide polymorphisms (SNPs)
如果你还不懂GSVA的原理,可以看这里:
https://mp.weixin.qq.com/s?__biz=MzAwMzIzOTk5OQ==&mid=2247484632&idx=1&sn=a9926fd8d42c902b6c787738c40de9cb&chksm=9b3f7d88ac48f49ee34b23e69ad0911fcb03932187e46cc8218e95bbd7bd38f4525862750ec6&token=1001067584&lang=zh_CN#rd
当然,生物软件开发者都是包容的,除GSVA算法本身外,GSVA包里还收纳了我们在原理中提到过的plage 、zscore、ssgsea这些方法,可以用 metohd= 来选择,默认方法当然是gsva
那我们再具体的看一下gsva函数的数据输入要求与参数选择问题
你的表达数据可以是一个简单的matrix也可以是ExpressionSet或SummarizedExperiment的对象(大家可以自行去bioconductor搜寻一下这两个包及其对象的特征)
基因集可以是一个list也可以是一个GeneSetCollection对象(结构与特征请参考:
https://bioconductor.org/packages/3.14/bioc/vignettes/GSEABase/inst/doc/GSEABase.pdf
还有一个需要注意的点是你的表达数据与基因集中的基因名称需要一一对应,如果你不知道怎么转换基因的symbol和ID:
https://mp.weixin.qq.com/s?__biz=MzAwMzIzOTk5OQ==&mid=2247484353&idx=2&sn=8fccb88c871a42c803e313be157ab293&chksm=9b3f7a91ac48f387df8b462f7318596fc5c645d6f05eefac2c598ff6875c0a2eeb902aaf5a15&token=1001067584&lang=zh_CN#rd
另外,出现以下几种情况时部分数据会被丢弃:
1、表达数据中包含常量表达式的(也就是说建议把你的矩阵都转换为数值型变量)
2、如果一个基因集中的基因没有在表达矩阵中出现,那么这个矩阵会被删除
3、在上面两条过滤执行完毕后,会丢弃不足一个基因的基因集。如果说你的所有基因与基因集都不满足要求,显然你的程序会报错
你也应该掌握以下参数的设置方法:
与表达矩阵相关的参数:
1、kcdf:我们首先需要对输入的表达数据进行kernel estimation,这是对表达数据预处理的一种方法,针对不同类别的输入数据,你需要更改它的默认方式:
(1) kcdf的默认值是“Gaussian”,适合用于连续表达数据例如,微阵列芯片的对数标准化结果、RNA-seq的log-CPMs,log-RPKMs or log-TPMs都属于高斯分布;
(2) “Poisson”,适合整数的count结果,也就是说RNA-Seq的reads数比对结果,这类数据属于泊松分布,单细胞中如何应用我们将会在下一篇推送中详解
(3) “none”, 不进行kernel estimation直接计算CDF (cumulative distribution functions)
2、mx.diff:用两种方法计算two Kolmogorov-Smirnov random walk statistics,默认值为TRUE,计算最大和“最负”的random walk deviationsm, FALSE计算的是他们到0的距离。如果不理解的话,就按默认值来
3、abs.ranking:在mx.diff=TRUE的时候生效默认为FALSE,用于计算富集分数,取最大的正和负随机行走偏差之间的幅度差。当abs.ranking=TRUE时,使用原始的Kuiper统计量,将最大的正的和最大的负的随机行走偏差相加。在这种情况下,基因富集在两个极端(高或低)的基因集将被认为是高度激活的。
4、tau:随机游动中尾部权重的指数,gsva中默认为1,ssgesa中默认为0.25
不过我相信大部分同学和我一样,对于这些参数是不太理解的,不过不用担心,默认参数就能满足大部分的分析需求,唯一需要注意的是你需要根据你的输入数据调整kcdf参数
基因集方面的话,结构比较简单,如果你想找一些数据库中的基因集,不妨试试
GO(http://geneontology.org/)
GSEA的MSigDB(https://www.gsea-msigdb.org/gsea/msigdb).当然,很多时候这里没有你所需要的基因集,那么你也可以通过查阅文件或者一些其他的特定数据库去总结、归纳,DIY一些你的专属基因集
我们也可以从org.Hs.eg.db系列包中去找一些基因集
suppressMessages(if(!require(org.Hs.eg.db))install.packages('org.Hs.eg.db'))#小鼠的就换成 org.Mm.eg.db
## Warning: package 'BiocGenerics' was built under R version 4.0.5
goannot <- select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns="GO")
## 'select()' returned 1:many mapping between keys and columns
head(goannot)
## ENTREZID GO EVIDENCE ONTOLOGY
## 1 1 GO:0002576 TAS BP
## 2 1 GO:0003674 ND MF
## 3 1 GO:0005576 HDA CC
## 4 1 GO:0005576 IDA CC
## 5 1 GO:0005576 TAS CC
## 6 1 GO:0005615 HDA CC
genesbygo <- split(goannot$ENTREZID, goannot$GO)
length(genesbygo)
## [1] 18348
head(genesbygo)
## $`GO:0000002`
## [1] "291" "1890" "4205" "4358" "4976" "9361" "10000" "55154" "55186"
## [10] "80119" "84275" "92667"
##
## $`GO:0000003`
## [1] "2796" "2797" "8510" "286826"
##
## $`GO:0000009`
## [1] "55650" "79087"
##
## $`GO:0000010`
## [1] "23590" "57107"
##
## $`GO:0000012`
## [1] "1161" "2074" "3981" "7141" "7515" "23411"
## [7] "54840" "54840" "55775" "55775" "55775" "200558"
## [13] "100133315"
##
## $`GO:0000014`
## [1] "2021" "2067" "2072" "4361" "4361" "5932" "5932" "6419" "9941"
## [10] "10111" "10111" "64421"
suppressMessages(if(!require(GSEABase))BiocManager::install('GSEABase'))
## Warning: package 'XML' was built under R version 4.0.5
suppressMessages(if(!require(GSVAdata))BiocManager::install('GSVAdata'))
data(c2BroadSets)
class(c2BroadSets)
## [1] "GeneSetCollection"
## attr(,"package")
## [1] "GSEABase"
c2BroadSets
## GeneSetCollection
## names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
## unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
## types in collection:
## geneIdType: EntrezIdentifier (1 total)
## collectionType: BroadCollection (1 total)
GeneSetCollection
## standardGeneric for "GeneSetCollection" defined from package "GSEABase"
##
## function (object, ..., idType, setType)
## standardGeneric("GeneSetCollection")
## <bytecode: 0x000000003b8978a8>
## <environment: 0x000000003b7d7050>
## Methods may be defined for arguments: object, idType, setType
## Use showMethods("GeneSetCollection") for currently available ones.
我们拿上篇推送中作者提到的微阵列及RNA-Seq数据测试一下
##### 给大家展示如何从这些包中取出基因集数据 #####
suppressMessages(library(Biobase))
data(commonPickrellHuang)
#### 看看两个基因集间的行名列名是否都一致
identical(featureNames(huangArrayRMAnoBatchCommon_eset),
featureNames(pickrellCountsArgonneCQNcommon_eset))
## [1] TRUE
identical(sampleNames(huangArrayRMAnoBatchCommon_eset),
sampleNames(pickrellCountsArgonneCQNcommon_eset))
## [1] TRUE
canonicalC2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
grep("^REACTOME", names(c2BroadSets)),
grep("^BIOCARTA", names(c2BroadSets)))]#grep就不用多说了吧,提一些通路出来
canonicalC2BroadSets
## GeneSetCollection
## names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
## unique identifiers: 55902, 2645, ..., 8544 (6744 total)
## types in collection:
## geneIdType: EntrezIdentifier (1 total)
## collectionType: BroadCollection (1 total)
GeneSetCollection
## standardGeneric for "GeneSetCollection" defined from package "GSEABase"
##
## function (object, ..., idType, setType)
## standardGeneric("GeneSetCollection")
## <bytecode: 0x000000003b8978a8>
## <environment: 0x000000003b7d7050>
## Methods may be defined for arguments: object, idType, setType
## Use showMethods("GeneSetCollection") for currently available ones.
data(genderGenesEntrez)
MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
collectionType=BroadCollection(category="c2"), setName="MSY")
MSY
## setName: MSY
## geneIds: 266, 84663, ..., 353513 (total: 34)
## geneIdType: EntrezId
## collectionType: Broad
## bcCategory: c2 (Curated)
## bcSubCategory: NA
## details: use 'details(object)'
XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
collectionType=BroadCollection(category="c2"), setName="XiE")
XiE
## setName: XiE
## geneIds: 293, 8623, ..., 1121 (total: 66)
## geneIdType: EntrezId
## collectionType: Broad
## bcCategory: c2 (Curated)
## bcSubCategory: NA
## details: use 'details(object)'
canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))
canonicalC2BroadSets
## GeneSetCollection
## names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., XiE (835 total)
## unique identifiers: 55902, 2645, ..., 1121 (6810 total)
## types in collection:
## geneIdType: EntrezIdentifier (1 total)
## collectionType: BroadCollection (1 total)
GeneSetCollection
## standardGeneric for "GeneSetCollection" defined from package "GSEABase"
##
## function (object, ..., idType, setType)
## standardGeneric("GeneSetCollection")
## <bytecode: 0x000000003b8978a8>
## <environment: 0x000000003b7d7050>
## Methods may be defined for arguments: object, idType, setType
## Use showMethods("GeneSetCollection") for currently available ones.
接下来就是gsva分析啦,其实一行也就搞定了
suppressMessages(
esmicro <- gsva(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
kcdf='Gaussian')#微阵列的数据符合Gaussian分布
)
## Mapping identifiers between gene sets and feature names
## Estimating GSVA scores for 806 gene sets.
## Estimating ECDFs with Gaussian kernels
##
suppressMessages(
esrnaseq <- gsva(pickrellCountsArgonneCQNcommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
kcdf="Poisson")#RNA-Seq的数据符合Poisson分布
)
## Mapping identifiers between gene sets and feature names
## Estimating GSVA scores for 806 gene sets.
## Estimating ECDFs with Poisson kernels
探究一下上面在微阵列和RNA-Seq之间GSVA分析结果的相关性
相关性不理解的话可以看这里:
https://mp.weixin.qq.com/s?__biz=MzAwMzIzOTk5OQ==&mid=2247484429&idx=1&sn=09f1b31c30a105add532fa17e8715e5a&chksm=9b3f7d5dac48f44bf739ec49a7827923eb0e87a8b26510627cf89c121e63c06561901720ab6c&token=1001067584&lang=zh_CN#rd
suppressMessages(if(!require(edgeR))BiocManager::install('edgeR'))
lcpms <- cpm(exprs(pickrellCountsArgonneCQNcommon_eset), log=TRUE)#把不连续的count转换为连续的cpm值(表达量的一种)
genecorrs <- sapply(1:nrow(lcpms),
function(i, expmicro, exprnaseq) cor(expmicro[i, ], exprnaseq[i, ],
method="spearman"),
exprs(huangArrayRMAnoBatchCommon_eset), lcpms)#计算表达量之间的相关性
names(genecorrs) <- rownames(lcpms)
genecorrs[1:5]
## 8567 23139 7580 55619 3008
## 0.3248391 0.2808237 0.5119691 0.3696268 0.1348777
pwycorrs <- sapply(1:nrow(esmicro),
function(i, esmicro, esrnaseq) cor(esmicro[i, ], esrnaseq[i, ],
method="spearman"),
exprs(esmicro), exprs(esrnaseq))#计算gsva值之间的相关性
names(pwycorrs) <- rownames(esmicro)
pwycorrs[1:5]
## KEGG_GLYCOLYSIS_GLUCONEOGENESIS
## 0.16808237
## KEGG_CITRATE_CYCLE_TCA_CYCLE
## 0.01415701
## KEGG_PENTOSE_PHOSPHATE_PATHWAY
## 0.09806950
## KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS
## 0.56447876
## KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM
## -0.09111969
##### 简单看一下两种相关性的分布
par(mfrow=c(1, 2), mar=c(4, 5, 3, 2))
hist(genecorrs, xlab="Spearman correlation", main="Gene level\n(RNA-seq log-CPMs vs microarray RMA)",
xlim=c(-1, 1), col="grey", las=1)
hist(pwycorrs, xlab="Spearman correlation", main="Pathway level\n(GSVA enrichment scores)",
xlim=c(-1, 1), col="grey", las=1)
看似分布情况是差不多的,当然你如果想知道这两组数据具体属于什么分布,你可以通过这里实现:
可以看出两个数据集之间的R值都在0.8附近,相关性还是比较可观的:
par(mfrow=c(1, 2))
rmsy <- cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])
plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ], xlab="GSVA scores RNA-seq",
ylab="GSVA scores microarray", main=sprintf("MSY R=%.2f", rmsy), las=1, type="n")
abline(lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ]), lwd=2, lty=2, col="grey")
points(exprs(esrnaseq["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Female"],
col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Male"],
col="blue", pch=21, bg="blue", cex=1)
legend("topleft", c("female", "male"), pch=21, col=c("red", "blue"), pt.bg=c("red", "blue"), inset=0.01)
rxie <- cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])
plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ], xlab="GSVA scores RNA-seq",
ylab="GSVA scores microarray", main=sprintf("XiE R=%.2f", rxie), las=1, type="n")
abline(lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ]), lwd=2, lty=2, col="grey")
points(exprs(esrnaseq["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Female"],
col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Male"],
col="blue", pch=21, bg="blue", cex=1)
legend("topleft", c("female", "male"), pch=21, col=c("red", "blue"), pt.bg=c("red", "blue"), inset=0.01)
当然,gsva的结果可以用来绘制热图,热图的绘制请看这里:
GSVA的结果也可以像RNA-Seq的表达量一样通过limma来完成差异分析
gsva_result <- esrnaseq #就拿上面的gsva结果来做吧
if(!require(limma))BiocManager::install('limma')
mygroup <- c(rep('GroupA',18),rep('GroupB',18))
mod <- model.matrix(~ factor(mygroup))
colnames(mod) <- c("GroupA", "GroupA_vs_GroupB")
fit <- lmFit(gsva_result, mod)
fit <- eBayes(fit)
res <- decideTests(fit, p.value=0.01)
summary(res)
## GroupA GroupA_vs_GroupB
## Down 0 0
## NotSig 806 806
## Up 0 0
#emm,我随机取样的数据,所以大都没有显著性
tt <- topTable(fit, coef=2, n=Inf)
head(tt)
## logFC AveExpr t
## BIOCARTA_CYTOKINE_PATHWAY 0.3760662 0.024081363 3.311998
## KEGG_HYPERTROPHIC_CARDIOMYOPATHY_HCM 0.2195076 0.001096574 2.962617
## BIOCARTA_ERYTH_PATHWAY 0.3667554 0.014098001 2.943412
## BIOCARTA_INFLAM_PATHWAY 0.2902441 0.002935639 2.874580
## KEGG_NON_HOMOLOGOUS_END_JOINING -0.2955033 -0.008181424 -2.780495
## REACTOME_EARLY_PHASE_OF_HIV_LIFE_CYCLE -0.2722189 -0.023766463 -2.779801
## P.Value adj.P.Val B
## BIOCARTA_CYTOKINE_PATHWAY 0.001894812 0.4229838 -2.306902
## KEGG_HYPERTROPHIC_CARDIOMYOPATHY_HCM 0.004974079 0.4229838 -2.780193
## BIOCARTA_ERYTH_PATHWAY 0.005236917 0.4229838 -2.805492
## BIOCARTA_INFLAM_PATHWAY 0.006289433 0.4229838 -2.895478
## KEGG_NON_HOMOLOGOUS_END_JOINING 0.008048862 0.4229838 -3.016661
## REACTOME_EARLY_PHASE_OF_HIV_LIFE_CYCLE 0.008063382 0.4229838 -3.017546
DEpwys <- rownames(tt)[tt$adj.P.Val <= 0.01]
plot(tt$logFC, -log10(tt$P.Value), pch=".", cex=4, col=grey(0.75),
main="", xlab="GSVA enrichment score difference", ylab=expression(-log[10]~~Raw~P-value))
#### 随便看一下差异分析的结果吧
#本来准备写个gsva的shiny在线版本,结果发现是我自作多情了,人作者早就给你们写好啦
suppressMessages(if(!require(shinythemes))install.packages('shinythemes'))
## Warning: package 'shinythemes' was built under R version 4.0.5
suppressMessages(if(!require(shiny))install.packages('shiny'))
res <- igsva()
效果如下,如果这个在线的工具需要出教程的话,右下角点个赞我看看有多少,如果不到两百个我应该就不用花这个时间了吧