查看原文
其他

Bioconductor注释专题:Inparanoid

王hh 生信菜鸟团 2022-06-07

Inparanoid简介

Inparanoid packages收集了通过 inparanoid 算法获得的同源基因组。该算法主要通过Blast算法比较一对不同物种蛋白质组(proteome)的相似性,获得orthology group,而在orthology group里的seed orthologs作为参考,用于遍历寻找和比对其他物种间更为相似的序列。orthology group里面的所有同源蛋白的数目称为inparalogs,每组orthology group的inparalog都有和seed ortholog的相似性打分。目前这个数据库已经比对了35个物种之间的同源蛋白,而在bioconductor上则有5个物种每个相比所有35个物种之间的同源蛋白数据集,分别是:

  • hom.Hs.inp.db for human mappings to the other 35 species

  • hom.Mm.inp.db for mouse mappings to the other 35 species

  • hom.Rn.inp.db for rat mappings to the other 35 species

  • hom.Dm.inp.db for fly mappings to the other 35 species

  • hom.Sc.inp.db for yeast mappings to the other 35 species

今天主要讲讲如何用这些注释包的使用,比如何获得某个基因的同源基因,同源基因之间的seed pairs。

以人(hom.Hs.inp.db)为例讲讲信息提取方法

每个注释包都有对应每个物种的信息,物种两两对应,比如人对小鼠,同样地,和小鼠对人的包内的信息保持一致,可以通过seed pair获得他们的交集。 hom.Hs.inp.db包里的数据主要包括:

  • hom.Hs.inp_dbconn:Collect information about the package annotation DB

  • hom.Hs.inpHOMSA :Map between IDs for genes in one organism to their predicted paralogs in another hom.Hs.inpRATNO map would provide mappings between human and rat.

  • hom.Hs.inpMAPCOUNTS:number of mapped keys for the maps in package hom.Hs.inp.db

  • hom.Hs.inpORGANISM:The Organism for hom.Hs.inp

包的组成如下图所示,人相对其他物种的paralog的对象名称为hom.Hs.inpXXXXX,XXXXX五个字符就是对应物种的缩写:

1. 通过基因直接检索

假如想看人和鼠的同源基因:

  1. > as.list(hom.Hs.inpMUSMU[1:4])

  2. $ENSP00000364178

  3. [1] "ENSMUSP00000097561"

  4. $ENSP00000356224

  5. [1] "ENSMUSP00000051825"

  6. $ENSP00000386259

  7. [1] "ENSMUSP00000074773"

  8. $ENSP00000271588

  9. [1] "ENSMUSP00000074340"

既然都是annotation db系列,从hom db里提取出来的信息也可以到org.db里查询:

  1. #随便挑个人的基因

  2. > # load the organism annotation data for human

  3. > library(org.Hs.eg.db)

  4. > # get the entrex gene ID and ensembl protein id for gene symbol "MSX2"

  5. > select(org.Hs.eg.db,

  6. + keys="MSX2",

  7. + columns=c("ENTREZID","ENSEMBLPROT"),

  8. + keytype="SYMBOL")

  9. SYMBOL ENTREZID ENSEMBLPROT

  10. 1 MSX2 4488 ENSP00000239243

  11. 2 MSX2 4488 ENSP00000427425

  12. #找同源

  13. > # use the inparanoid package to get the mouse gene that is considered

  14. > # equivalent to ensembl protein ID "ENSP00000239243"

  15. > select(hom.Hs.inp.db,

  16. + keys="ENSP00000239243",

  17. + columns="MUS_MUSCULUS",

  18. + keytype="HOMO_SAPIENS")

  19. HOMO_SAPIENS MUS_MUSCULUS

  20. 1 ENSP00000239243 ENSMUSP00000021922

  21. # load the organism annotation data for mouse

  22. > library(org.Mm.eg.db)

  23. > # get the entrez gene ID and gene Symbol for "ENSMUSP00000021922"

  24. > select(org.Mm.eg.db,

  25. + keys="ENSMUSP00000021922",

  26. + columns=c("ENTREZID","SYMBOL"),

  27. + keytype="ENSEMBLPROT")

  28. ENSEMBLPROT ENTREZID SYMBOL

  29. 1 ENSMUSP00000021922 17702 Msx2

2.通过dbi接口以seed pair方式进行查询

连接数据库

  1. > # make a connection to the human database

  2. > mycon <- hom.Hs.inp_dbconn()

  3. > # make a list of all the tables that are available in the DB

  4. > head(dbListTables(mycon))

  5. [1] "Acyrthosiphon_pisum" "Aedes_aegypti"

  6. [3] "Anopheles_gambiae" "Apis_mellifera"

  7. [5] "Arabidopsis_thaliana" "Aspergillus_fumigatus"

  8. > # make a list of the columns in the table of interest

  9. > dbListFields(mycon, "mus_musculus") ##获得数据库的名字mus_musculus

  10. [1] "inp_id" "clust_id" "species" "score"

  11. [5] "seed_status"

通过SQL语句查询mus_musculus中ENSP00000301011对应的种子“cluster ID”

  1. > #make a query that will let us see which clust_id we need #

  2. > sql <- "SELECT * FROM mus_musculus WHERE inp_id = 'ENSP00000301011';"

  3. > #retrieve the data

  4. > dataOut <- dbGetQuery(mycon, sql)

  5. > dataOut

  6. inp_id clust_id species score seed_status

  7. 1 ENSP00000301011 2084 HOMSA 1 100%

又或者想通过cluster ID查询inparanoid里基因名

  1. > #make a query that will let us see all the data that is affiliated with a clust id

  2. > sql <- "SELECT * FROM mus_musculus WHERE clust_id = '1731';"

  3. > #retrieve the data

  4. > dataOut <- dbGetQuery(mycon, sql)

  5. > dataOut

  6. inp_id clust_id species score seed_status

  7. 1 ENSP00000273857 1731 HOMSA 1 100%

  8. 2 ENSMUSP00000005352 1731 MUSMU 1 100%

id 1731对应的物种同源蛋白分别是ENSP00000273857 和 ENSMUSP00000005352

其他

AnnotationDbi也提供了一个函数inpIDMapper可以快速提取物种间同源蛋白,比较方便。

REF http://bioconductor.org/packages/release/data/annotation/manuals/hom.Hs.inp.db/man/hom.Hs.inp.db.pdf http://www.imsbio.co.jp/RGM/Rrdfile?f=hom.Hs.inp.db/man/hom.Hs.inpORGVSORGMAP.Rd&d=RBC https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.34.4/topics/inpIDMapper




猜你喜欢

生信基础知识100讲

生信菜鸟团-专题学习目录(5)

生信菜鸟团-专题学习目录(6)

还有更多文章,请移步公众号阅读

▼ 如果你生信基本技能已经入门,需要提高自己,请关注下面的生信技能树,看我们是如何完善生信技能,成为一个生信全栈工程师。

▼ 如果你是初学者,请关注下面的生信菜鸟团,了解生信基础名词,概念,扎实的打好基础,争取早日入门。




      



您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存