clusterProfiler
can convert biological IDs using OrgDb
object via the bitr
function. Now I implemented another function, bitr_kegg
for converting IDs through KEGG API.
library(clusterProfiler)
data(gcSample)
hg <- gcSample[[1]]
head(hg)
# [1] "4597" "7111" "5266" "2175" "755" "23046"
eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')
# Warning in bitr_kegg(hg, fromType = "kegg", toType = "ncbi-proteinid",
# organism = "hsa"): 3.7% of input gene IDs are fail to map...
head(eg2np)
# kegg ncbi-proteinid
# 1 8326 NP_003499
# 2 58487 NP_001034707
# 3 139081 NP_619647
# 4 59272 NP_068576
# 5 993 NP_001780
# 6 2676 NP_001487
np2up <- bitr_kegg(eg2np[,2], fromType='ncbi-proteinid', toType='uniprot', organism='hsa')
head(np2up)
# ncbi-proteinid uniprot
# 1 NP_005457 O75586
# 2 NP_005792 P41567
# 3 NP_005792 Q6IAV3
# 4 NP_037536 Q13421
# 5 NP_006054 O60662
# 6 NP_001092002 O95398
The ID type (both fromType & toType) should be one of 'kegg', 'ncbi-geneid', 'ncbi-proteinid' or 'uniprot'. The 'kegg' is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the 'kegg' ID is entrezgene
ID for eukaryote species and Locus
ID for prokaryotes.
Many prokaryote species don't have entrezgene ID available. For example we can check the gene information of ece:Z5100
in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID
and UnitProt
links in the Other DBs
Entry, but not NCBI-GeneID
.
If we try to convert Z5100
to ncbi-geneid
, bitr_kegg
will throw error of ncbi-geneid is not supported
.
bitr_kegg("Z5100", fromType="kegg", toType='ncbi-geneid', organism='ece')
## Error in KEGG_convert(fromType, toType, organism) :
## ncbi-geneid is not supported for ece ...
We can of course convert it to ncbi-proteinid
and uniprot
:
bitr_kegg("Z5100", fromType="kegg", toType='ncbi-proteinid', organism='ece')
## kegg ncbi-proteinid
## 1 Z5100 AAG58814
bitr_kegg("Z5100", fromType="kegg", toType='uniprot', organism='ece')
## kegg uniprot
## 1 Z5100 Q7DB85
search_kegg_organism
clusterProfiler
supports more than 4k species listed in http://www.genome.jp/kegg/catalog/org_list.html for hypergeometric test (enrichKEGG
& enrichMKEGG
) and GSEA (gseKEGG
& gseMKEGG
). We can use bitr_kegg
to convert ID for all these 4k species. To facilitate searching scientific name abbreviate used in the organism
parameter of these functions, I implemented the search_kegg_organism
function. We can search by kegg_code
, scientific_name
or common_name
(which is not available for prokaryotes).
search_kegg_organism('ece', by='kegg_code')
# kegg_code scientific_name common_name
# 334 ece Escherichia coli O157:H7 EDL933 (EHEC) <NA>
ecoli <- search_kegg_organism('Escherichia coli', by='scientific_name')
dim(ecoli)
# [1] 64 3
head(ecoli)
# kegg_code scientific_name common_name
# 329 eco Escherichia coli K-12 MG1655 <NA>
# 330 ecj Escherichia coli K-12 W3110 <NA>
# 331 ecd Escherichia coli K-12 DH10B <NA>
# 332 ebw Escherichia coli BW2952 <NA>
# 333 ecok Escherichia coli K-12 MDS42 <NA>
# 334 ece Escherichia coli O157:H7 EDL933 (EHEC) <NA>
keyType parameter
With the ID conversion utilities built in clusterProfiler
, I add a parameter keyType
in enrichKEGG
, enrichMKEGG
, gseKEGG
and gseMKEGG
. Now we can use ID type that is not the primary ID in KEGG database.
x <- enrichKEGG(np2up[,2], organism='hsa', keyType='uniprot')
head(summary(x))
# ID Description GeneRatio
# hsa04072 hsa04072 Phospholipase D signaling pathway 11/133
# hsa04060 hsa04060 Cytokine-cytokine receptor interaction 14/133
# hsa04390 hsa04390 Hippo signaling pathway 10/133
# hsa04975 hsa04975 Fat digestion and absorption 5/133
# hsa05221 hsa05221 Acute myeloid leukemia 6/133
# BgRatio pvalue p.adjust qvalue
# hsa04072 216/9275 0.0002654190 0.03901659 0.03240905
# hsa04060 354/9275 0.0005349245 0.03931695 0.03265855
# hsa04390 213/9275 0.0009536247 0.04199404 0.03488227
# hsa04975 58/9275 0.0014014886 0.04199404 0.03488227
# hsa05221 86/9275 0.0014283687 0.04199404 0.03488227
# geneID
# hsa04072 O95398/Q99777/P49619/Q6FGP0/Q8WVM9/O14807/P41594/A8K5P7/P10145/A0A024RDA5/P16234
# hsa04060 A0N0N3/O00574/P19876/P01589/P10145/A0A024RDA5/B4DGA4/Q99665/P16234/P78556/Q6I9S7/P42830/P27930/Q9UBN6
# hsa04390 Q8WW10/A8K141/Q9UI47/P35240/A0A024R1J8/Q659G9/Q9UJU2/P22003/M9VUD0/O00144
# hsa04975 Q9UNK4/A0A087WZT4/A0A0C4DFX6/Q9UHC9/P04054
# hsa05221 Q659G9/Q9UJU2/Q03181/A0A024RCW6/Q06455/B2R6I9
# Count
# hsa04072 11
# hsa04060 14
# hsa04390 10
# hsa04975 5
# hsa05221 6
setReadable
For GO analysis, we have a readable
parameter to control whether traslating the IDs to human readable gene name. This parameter is not available for KEGG analysis. But we still have the ability to translate input gene IDs to gene name using setReadable
function if and only if corresponding OrgDb
object is available.
y <- setReadable(x, 'org.Hs.eg.db', keytype="UNIPROT")
head(summary(y))
# ID Description GeneRatio
# hsa04072 hsa04072 Phospholipase D signaling pathway 11/133
# hsa04060 hsa04060 Cytokine-cytokine receptor interaction 14/133
# hsa04390 hsa04390 Hippo signaling pathway 10/133
# hsa04975 hsa04975 Fat digestion and absorption 5/133
# hsa05221 hsa05221 Acute myeloid leukemia 6/133
# BgRatio pvalue p.adjust qvalue
# hsa04072 216/9275 0.0002654190 0.03901659 0.03240905
# hsa04060 354/9275 0.0005349245 0.03931695 0.03265855
# hsa04390 213/9275 0.0009536247 0.04199404 0.03488227
# hsa04975 58/9275 0.0014014886 0.04199404 0.03488227
# hsa05221 86/9275 0.0014283687 0.04199404 0.03488227
# geneID
# hsa04072 RAPGEF3/RAPGEF3/DGKG/MRAS/MRAS/MRAS/GRM5/GRM5/CXCL8/CXCL8/PDGFRA
# hsa04060 CXCR6/CXCR6/CXCL3/IL2RA/CXCL8/CXCL8/IL12RB2/IL12RB2/PDGFRA/CCL20/CXCL5/CXCL5/IL1R2/TNFRSF10D
# hsa04390 CTNNA3/CTNNA3/CTNNA3/NF2/NF2/LEF1/LEF1/BMP5/BMP5/FZD9
# hsa04975 PLA2G2D/PLA2G2D/NPC1L1/NPC1L1/PLA2G1B
# hsa05221 LEF1/LEF1/PPARD/PPARD/RUNX1T1/RUNX1T1
# Count
# hsa04072 11
# hsa04060 14
# hsa04390 10
# hsa04975 5
# hsa05221 6
经常有人问我用enricher或GSEA分析的话,没有readable参数,要知道这两函数是通用的富集分析工具,对于你要做什么(包括知识库,物种,ID类型)是没有任何假设的,请问我该如何来为你自动转ID,答案是不可能,但你自己做什么,心里应该有点B数,那么我为你提供了setReadable函数,可以帮忙解决部分的ID转换问题,当然肯定不是全部。
另外文章《ko数据库ID转换》一文中也展示了利用KEGG进行ID转换,它的内容拓展了本文,不单是基因之间的ID可以转换,而且可以把基因映射到通路上,或者反之,都是clusterProfiler所支持的。