use clusterProfiler as an universal enrichment analysis tool
总有人觉得只是做GO/KEGG的,其实那只不过是内置「快捷方式」而已,clusterProfiler是个通用工具,允许使用用户自己数据进化富集分析,那么就是你想做COG,domain的富集分析都不在话下。你要做KEGG,不一定要用
enrichKEGG
和gseKEGG
,你要做GO,也不一定要通过enrichGO
和gseGO
,所以非模式、没有OrgDb,KEGG数据库没收录,这些都不是问题,你要想的是怎么搞个注释出来,各种数据库、(帮你做注释的)软件,搞到注释之后,今天要介绍的,就是你想要的,抛开那些条条框框。
clusterProfiler supports enrichment analysis of both hypergeometric test and gene set enrichment analysis. It internally supports Gene Ontology analysis of about 20 species, Kyoto Encyclopedia of Genes and Genomes (KEGG) with all species that have annotation available in KEGG database, DAVID annotation (only hypergeometric test supported), Disease Ontology and Network of Cancer Genes (via DOSE for human) and Reactome Pathway (via ReactomePA for several species). This is still not enough for users may want to analyze their data with unsupported organisms, slim version of
GO, novel functional annotation (eg GO via blastgo and KEGG via KAAS), unsupported ontology/pathway or customized annotation.
clusterProfiler provides enricher function for hypergeometric test and GSEA function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and
TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
Here I used DisGeNET which contains 381056 associations, between 16666 genes and 13172 diseases, as an example to demonstrate the usage of enricher and GSEA functions.
> require(DOSE)
> require(clusterProfiler)
> data(geneList)
> deg <- names(geneList)[abs(geneList)>2]
## downloaded from http://www.disgenet.org/ds/DisGeNET/results/all_gene_disease_associations.tar.gz
> gda <- read.delim("all_gene_disease_associations.txt")
> dim(gda)
[1] 415681 9
> head(gda)
geneId geneSymbol geneName diseaseId
1 540 ATP7B ATPase, Cu++ transporting, beta polypeptide umls:C0019202
2 540 ATP7B ATPase, Cu++ transporting, beta polypeptide umls:C0019202
3 540 ATP7B ATPase, Cu++ transporting, beta polypeptide umls:C0019202
4 4160 MC4R melanocortin 4 receptor umls:C0028754
5 4160 MC4R melanocortin 4 receptor umls:C0028754
6 3667 IRS1 insulin receptor substrate 1 umls:C0011860
diseaseName score NumberOfPubmeds associationType
1 Hepatolenticular Degeneration 0.9898764 229 GeneticVariation
2 Hepatolenticular Degeneration 0.9898764 229 Biomarker
3 Hepatolenticular Degeneration 0.9898764 229 AlteredExpression
4 Obesity 0.9400000 234 GeneticVariation
5 Obesity 0.9400000 234 Biomarker
6 Diabetes Mellitus, Type 2 0.9077659 106 GeneticVariation
source
1 UNIPROT, CTD_human, MGD, RGD, GAD, LHGDN, BeFree
2 UNIPROT, CTD_human, MGD, RGD, GAD, LHGDN, BeFree
3 UNIPROT, CTD_human, MGD, RGD, GAD, LHGDN, BeFree
4 UNIPROT, CTD_human, MGD, RGD, GAD, BeFree
5 UNIPROT, CTD_human, MGD, RGD, GAD, BeFree
6 UNIPROT, CTD_human, MGD, RGD, GAD, BeFree
>
> disease2gene=gda[, c("diseaseId", "geneId")]
> disease2name=gda[, c("diseaseId", "diseaseName")]
> x = enricher(deg, TERM2GENE=disease2gene, TERM2NAME=disease2name)
> head(summary(x))
ID Description GeneRatio BgRatio
umls:C0006142 umls:C0006142 Malignant neoplasm breast 89/196 3324/16666
umls:C0006826 umls:C0006826 NEOPLASM MALIGNANT 113/196 5102/16666
umls:C0678222 umls:C0678222 Breast Carcinoma 83/196 3072/16666
umls:C1458156 umls:C1458156 Recurrent Malignant Neoplasm 47/196 1158/16666
umls:C1458155 umls:C1458155 Breast Neoplasms 60/196 1981/16666
umls:C0600139 umls:C0600139 prostate carcinoma 56/196 2033/16666
pvalue p.adjust qvalue
umls:C0006142 4.934860e-16 9.845045e-13 8.249008e-13
umls:C0006826 3.071791e-15 3.064112e-12 2.567371e-12
umls:C0678222 5.894907e-15 3.920113e-12 3.284601e-12
umls:C1458156 3.230248e-14 1.611086e-11 1.349904e-11
umls:C1458155 1.724317e-12 6.880024e-10 5.764664e-10
umls:C0600139 5.025005e-10 1.670814e-07 1.399949e-07
geneID
umls:C0006142 4312/10874/991/6280/2305/9493/1062/4605/9833/9133/6279/10403/8685/597/7153/6278/259266/3627/27074/6241/7368/11065/55355/9582/55872/51203/3669/22974/10460/10563/4751/820/27338/890/983/4085/6362/9837/5918/332/3832/6286/5163/2146/3002/50852/7272/2568/9212/51659/1111/9055/4321/3620/6790/891/4174/9232/9370/1602/4629/771/3117/80129/23158/125/1311/5174/4250/2167/652/4036/4137/8839/2066/4693/3158/3169/1408/9547/2922/10647/5304/8614/2625/7802/9/5241/10551
umls:C0006826 4312/10874/55388/991/6280/2305/1062/3868/4605/9833/6279/10403/597/7153/6278/79733/3627/27074/6241/55165/9787/7368/11065/9582/55872/51203/3669/83461/22974/10460/10563/4751/6373/8140/1844/4283/27299/27338/890/9415/983/10232/4085/6362/5080/5918/81620/332/3832/6286/5163/2146/3002/50852/7272/2568/2842/9212/8715/1111/9055/3833/4321/11182/10112/3902/3620/3887/51514/6790/4521/891/8544/24137/10578/9232/9370/1602/3708/9122/10699/4629/771/3117/23158/125/4958/1311/2018/1308/4250/652/80736/4036/8839/2066/4693/3169/1408/9547/2922/1524/1580/10647/5304/8614/2625/7802/11122/9/5241/10551/4969
umls:C0678222 4312/10874/6280/2305/4605/9833/9133/6279/10403/8685/597/7153/6278/259266/3627/27074/6241/55165/11065/55355/9582/55872/51203/3669/10563/4751/820/27338/890/983/4085/6362/9837/5918/332/6286/5163/2146/3002/50852/7272/2568/9212/1111/9055/4321/3620/6790/891/9232/9370/1602/3708/4629/771/3117/23158/125/1311/5174/2532/4250/2167/652/4036/4137/8839/2066/4693/3158/3169/9547/2922/1524/10647/5304/8614/2625/7021/7802/9/5241/10551
umls:C1458156 4312/991/6280/6279/8685/7153/6278/259266/3627/27074/6241/55165/9787/3669/22974/983/10232/4085/5080/332/2146/3002/50852/2568/9212/4321/3620/3887/6790/891/4174/9232/3708/4629/771/3117/23158/730/2018/4036/2066/9547/2625/9/5241/10551/57758
umls:C1458155 4312/10874/2305/4605/9833/9133/10403/597/7153/6278/259266/3627/11065/9582/3669/10563/8140/820/1844/27338/890/9415/983/81620/332/2146/3002/2568/9212/51659/1111/9055/11182/3620/51514/6790/4521/891/8544/9232/9370/1602/771/23158/125/4250/2167/652/8839/2066/3169/10647/5304/8614/2625/7021/9/5241/10551/57758
umls:C0600139 4312/6280/2305/4605/9833/6279/7153/3627/6241/55165/11065/22974/10563/8140/1844/890/983/5080/5918/6286/5163/2146/9212/1111/4321/3620/6790/891/8544/9232/9370/8857/1602/3708/23090/4629/3117/23158/2532/2167/652/80736/4036/3169/9547/2922/11283/1524/5304/8614/2625/79689/11122/9/5241/10551
Count
umls:C0006142 89
umls:C0006826 113
umls:C0678222 83
umls:C1458156 47
umls:C1458155 60
umls:C0600139 56
>
> barplot(x)
> y = GSEA(geneList, TERM2GENE=disease2gene, TERM2NAME=disease2name)
> head(summary(y))
ID Description setSize enrichmentScore
umls:C0001418 umls:C0001418 Adenocarcinoma 1224 0.2431172
umls:C0001973 umls:C0001973 Alcoholism 548 -0.3213479
umls:C0002171 umls:C0002171 Alopecia Areata 74 0.5122605
umls:C0003864 umls:C0003864 Arthritis 447 0.3140411
umls:C0003872 umls:C0003872 Arthritis, Psoriatic 122 0.4311280
umls:C0003873 umls:C0003873 Arthritis, Rheumatoid 1186 0.2855594
pvalue p.adjust qvalues
umls:C0001418 0 0 0
umls:C0001973 0 0 0
umls:C0002171 0 0 0
umls:C0003864 0 0 0
umls:C0003872 0 0 0
umls:C0003873 0 0 0
> gseaplot(y, "umls:C0003872")
Citation
Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.