Bioconductor注释专题:Inparanoid
Inparanoid简介
Inparanoid packages收集了通过 inparanoid 算法获得的同源基因组。该算法主要通过Blast算法比较一对不同物种蛋白质组(proteome)的相似性,获得orthology group,而在orthology group里的seed orthologs作为参考,用于遍历寻找和比对其他物种间更为相似的序列。orthology group里面的所有同源蛋白的数目称为inparalogs,每组orthology group的inparalog都有和seed ortholog的相似性打分。目前这个数据库已经比对了35个物种之间的同源蛋白,而在bioconductor上则有5个物种每个相比所有35个物种之间的同源蛋白数据集,分别是:
hom.Hs.inp.db for human mappings to the other 35 species
hom.Mm.inp.db for mouse mappings to the other 35 species
hom.Rn.inp.db for rat mappings to the other 35 species
hom.Dm.inp.db for fly mappings to the other 35 species
hom.Sc.inp.db for yeast mappings to the other 35 species
今天主要讲讲如何用这些注释包的使用,比如何获得某个基因的同源基因,同源基因之间的seed pairs。
以人(hom.Hs.inp.db)为例讲讲信息提取方法
每个注释包都有对应每个物种的信息,物种两两对应,比如人对小鼠,同样地,和小鼠对人的包内的信息保持一致,可以通过seed pair获得他们的交集。 hom.Hs.inp.db包里的数据主要包括:
hom.Hs.inp_dbconn:Collect information about the package annotation DB
hom.Hs.inpHOMSA :Map between IDs for genes in one organism to their predicted paralogs in another hom.Hs.inpRATNO map would provide mappings between human and rat.
hom.Hs.inpMAPCOUNTS:number of mapped keys for the maps in package hom.Hs.inp.db
hom.Hs.inpORGANISM:The Organism for hom.Hs.inp
包的组成如下图所示,人相对其他物种的paralog的对象名称为hom.Hs.inpXXXXX,XXXXX五个字符就是对应物种的缩写:
1. 通过基因直接检索
假如想看人和鼠的同源基因:
> as.list(hom.Hs.inpMUSMU[1:4])
$ENSP00000364178
[1] "ENSMUSP00000097561"
$ENSP00000356224
[1] "ENSMUSP00000051825"
$ENSP00000386259
[1] "ENSMUSP00000074773"
$ENSP00000271588
[1] "ENSMUSP00000074340"
既然都是annotation db系列,从hom db里提取出来的信息也可以到org.db里查询:
#随便挑个人的基因
> # load the organism annotation data for human
> library(org.Hs.eg.db)
> # get the entrex gene ID and ensembl protein id for gene symbol "MSX2"
> select(org.Hs.eg.db,
+ keys="MSX2",
+ columns=c("ENTREZID","ENSEMBLPROT"),
+ keytype="SYMBOL")
SYMBOL ENTREZID ENSEMBLPROT
1 MSX2 4488 ENSP00000239243
2 MSX2 4488 ENSP00000427425
#找同源
> # use the inparanoid package to get the mouse gene that is considered
> # equivalent to ensembl protein ID "ENSP00000239243"
> select(hom.Hs.inp.db,
+ keys="ENSP00000239243",
+ columns="MUS_MUSCULUS",
+ keytype="HOMO_SAPIENS")
HOMO_SAPIENS MUS_MUSCULUS
1 ENSP00000239243 ENSMUSP00000021922
# load the organism annotation data for mouse
> library(org.Mm.eg.db)
> # get the entrez gene ID and gene Symbol for "ENSMUSP00000021922"
> select(org.Mm.eg.db,
+ keys="ENSMUSP00000021922",
+ columns=c("ENTREZID","SYMBOL"),
+ keytype="ENSEMBLPROT")
ENSEMBLPROT ENTREZID SYMBOL
1 ENSMUSP00000021922 17702 Msx2
2.通过dbi接口以seed pair方式进行查询
连接数据库
> # make a connection to the human database
> mycon <- hom.Hs.inp_dbconn()
> # make a list of all the tables that are available in the DB
> head(dbListTables(mycon))
[1] "Acyrthosiphon_pisum" "Aedes_aegypti"
[3] "Anopheles_gambiae" "Apis_mellifera"
[5] "Arabidopsis_thaliana" "Aspergillus_fumigatus"
> # make a list of the columns in the table of interest
> dbListFields(mycon, "mus_musculus") ##获得数据库的名字mus_musculus
[1] "inp_id" "clust_id" "species" "score"
[5] "seed_status"
通过SQL语句查询mus_musculus中ENSP00000301011对应的种子“cluster ID”
> #make a query that will let us see which clust_id we need #
> sql <- "SELECT * FROM mus_musculus WHERE inp_id = 'ENSP00000301011';"
> #retrieve the data
> dataOut <- dbGetQuery(mycon, sql)
> dataOut
inp_id clust_id species score seed_status
1 ENSP00000301011 2084 HOMSA 1 100%
又或者想通过cluster ID查询inparanoid里基因名
> #make a query that will let us see all the data that is affiliated with a clust id
> sql <- "SELECT * FROM mus_musculus WHERE clust_id = '1731';"
> #retrieve the data
> dataOut <- dbGetQuery(mycon, sql)
> dataOut
inp_id clust_id species score seed_status
1 ENSP00000273857 1731 HOMSA 1 100%
2 ENSMUSP00000005352 1731 MUSMU 1 100%
id 1731对应的物种同源蛋白分别是ENSP00000273857 和 ENSMUSP00000005352
其他
AnnotationDbi也提供了一个函数inpIDMapper可以快速提取物种间同源蛋白,比较方便。
REF http://bioconductor.org/packages/release/data/annotation/manuals/hom.Hs.inp.db/man/hom.Hs.inp.db.pdf http://www.imsbio.co.jp/RGM/Rrdfile?f=hom.Hs.inp.db/man/hom.Hs.inpORGVSORGMAP.Rd&d=RBC https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.34.4/topics/inpIDMapper
猜你喜欢
生信菜鸟团-专题学习目录(6)
还有更多文章,请移步公众号阅读
▼ 如果你生信基本技能已经入门,需要提高自己,请关注下面的生信技能树,看我们是如何完善生信技能,成为一个生信全栈工程师。
▼ 如果你是初学者,请关注下面的生信菜鸟团,了解生信基础名词,概念,扎实的打好基础,争取早日入门。