GSEApy 做富集分析及可视化

Original JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集 #python笔记 42个

点击上方关注老俊俊

1引言

一个大佬基于 python 版本写了这个工具,帮助 python 用户在 IDE 端方便富集分析及可视化:

这个工具有以下 6 个小功能:

包括了 GSEA 富集分析,单样本 GSEA 富集分析,绘制 GSEA 图,普通富集分析,ID 转换等等。

github 地址:

https://github.com/zqfang/GSEApy

2安装

# if you have conda
$ conda install -c conda-forge -c bioconda gseapy

# or use pip to install the latest release
$ pip install gseapy

# the development version from Github
$ pip install git+git://github.com/zqfang/gseapy.git#egg=gseapy

3使用示例

命令行版:

# An example to reproduce figures using replot module.
$ gseapy replot -i ./Gsea.reports -o test


# An example to run GSEA using gseapy gsea module
$ gseapy gsea -d exptable.txt -c test.cls -g gene_sets.gmt -o test

# An example to run Prerank using gseapy prerank module
$ gseapy prerank -r gsea_data.rnk -g gene_sets.gmt -o test

# An example to run ssGSEA using gseapy ssgsea module
$ gseapy ssgsea -d expression.txt -g gene_sets.gmt -o test

# An example to use enrichr api
# see details of -g below, -d  is optional
$ gseapy enrichr -i gene_list.txt -g KEGG_2016 -d pathway_enrichment -o test

python IDE 里使用:

import gseapy

# run GSEA.
gseapy.gsea(data='expression.txt', gene_sets='gene_sets.gmt', cls='test.cls', outdir='test')

# run prerank
gseapy.prerank(rnk='gsea_data.rnk', gene_sets='gene_sets.gmt', outdir='test')

# run ssGSEA
gseapy.ssgsea(data="expression.txt", gene_sets= "gene_sets.gmt", outdir='test')


# An example to reproduce figures using replot module.
gseapy.replot(indir='./Gsea.reports', outdir='test')

基于数据框,字典,列表等格式:

# assign dataframe, and use enrichr library data set 'KEGG_2016'
expression_dataframe = pd.DataFrame()

sample_name = ['A','A','A','B','B','B'] # always only two group,any names you like

# assign gene_sets parameter with enrichr library name or gmt file on your local computer.
gseapy.gsea(data=expression_dataframe, gene_sets='KEGG_2016', cls= sample_names, outdir='test')

# using prerank tool
gene_ranked_dataframe = pd.DataFrame()
gseapy.prerank(rnk=gene_ranked_dataframe, gene_sets='KEGG_2016', outdir='test')

# using ssGSEA
gseapy.ssgsea(data=ssGSEA_dataframe, gene_sets='KEGG_2016', outdir='test')

富集分析

# assign a list object to enrichr
gl = ['SCARA3', 'LOC100044683', 'CMBL', 'CLIC6', 'IL13RA1', 'TACSTD2', 'DKKL1', 'CSF1',
     'SYNPO2L', 'TINAGL1', 'PTX3', 'BGN', 'HERC1', 'EFNA1', 'CIB2', 'PMP22', 'TMEM173']

gseapy.enrichr(gene_list=gl, description='pathway', gene_sets='KEGG_2016', outdir='test')

# or a txt file path.
gseapy.enrichr(gene_list='gene_list.txt', description='pathway', gene_sets='KEGG_2016',
               outdir='test', cutoff=0.05, format='png' )

自带部分基因集

作者还提供了一些基因集:

 #see full list of latest enrichr library names, which will pass to -g parameter:
 names = gseapy.get_library_name()

 # show top 20 entries.
 print(names[:20])


['Genome_Browser_PWMs',
'TRANSFAC_and_JASPAR_PWMs',
'ChEA_2013',
'Drug_Perturbations_from_GEO_2014',
'ENCODE_TF_ChIP-seq_2014',
'BioCarta_2013',
'Reactome_2013',
'WikiPathways_2013',
'Disease_Signatures_from_GEO_up_2014',
'KEGG_2016',
'TF-LOF_Expression_from_GEO',
'TargetScan_microRNA',
'PPI_Hub_Proteins',
'GO_Molecular_Function_2015',
'GeneSigDB',
'Chromosome_Location',
'Human_Gene_Atlas',
'Mouse_Gene_Atlas',
'GO_Cellular_Component_2015',
'GO_Biological_Process_2015',
'Human_Phenotype_Ontology',]

4功能演示教程

ID 转换

# %matplotlib inline
# %config InlineBackend.figure_format='retina' # mac
# %load_ext autoreload
# %autoreload 2
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt

转换 id:

>>> from gseapy.parser import Biomart
>>> bm = Biomart()
>>> ## view validated marts
>>> marts = bm.get_marts()
>>> ## view validated dataset
>>> datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')
>>> ## view validated attributes
>>> attrs = bm.get_attributes(dataset='hsapiens_gene_ensembl')
>>> ## view validated filters
>>> filters = bm.get_filters(dataset='hsapiens_gene_ensembl')
>>> ## query results
>>> queries = ['ENSG00000125285','ENSG00000182968'] # need to be a python list
>>> results = bm.query(dataset='hsapiens_gene_ensembl',
                       attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id', 'go_id'],
                       filters={'ensemble_gene_id': queries})

Enrichr 富集分析

# read in an example gene list
gene_list = pd.read_csv("./tests/data/gene_list.txt",header=None, sep="\t")
gene_list.head()

转为 list:

# convert dataframe or series to list
glist = gene_list.squeeze().str.strip().tolist()
print(glist[:10])

['IGKV4-1', 'CD55', 'IGKC', 'PPFIBP1', 'ABHD4', 'PCSK6', 'PGD', 'ARHGDIB', 'ITGB2', 'CARD6']

查看数据库:

names = gp.get_library_name() # default: Human
names[:10]

['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_10x_scRNA_2021']

选择物种 { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }:

yeast = gp.get_library_name(organism='Yeast')
yeast[:10]

['Cellular_Component_AutoRIF',
 'Cellular_Component_AutoRIF_Predicted_zscore',
 'GO_Biological_Process_2018',
 'GO_Biological_Process_AutoRIF',
 'GO_Biological_Process_AutoRIF_Predicted_zscore',
 'GO_Cellular_Component_2018',
 'GO_Cellular_Component_AutoRIF',
 'GO_Cellular_Component_AutoRIF_Predicted_zscore',
 'GO_Molecular_Function_2018',
 'GO_Molecular_Function_AutoRIF']

可以支持传入多个基因集合,用逗号或者列表储存,富集分析:

# run enrichr
# if you are only intrested in dataframe that enrichr returned, please set no_plot=True

# list, dataframe, series inputs are supported
enr = gp.enrichr(gene_list="./tests/data/gene_list.txt",
                 gene_sets=['KEGG_2016','KEGG_2013'],
                 organism='Human', # don't forget to set organism to the one you desired! e.g. Yeast
                 description='test_name',
                 outdir='test/enrichr_kegg',
                 # no_plot=True,
                 cutoff=0.5 # test dataset, use lower value from range(0,1)
                )

查看结果:

# obj.results stores all results
enr.results.head(5)

使用 gmt 文件富集分析:

enr2 = gp.enrichr(gene_list="./tests/data/gene_list.txt",
                 # or gene_list=glist
                 description='test_name',
                 gene_sets="./tests/data/genes.gmt",
                 background='hsapiens_gene_ensembl', # or the number of genes, e.g 20000
                 outdir='test/enrichr_kegg2',
                 cutoff=0.5, # only used for testing.
                 verbose=True)

2021-11-15 10:12:51,199 Connecting to Enrichr Server to get latest library names
2021-11-15 10:12:51,200 User Defined gene sets is given: ./tests/data/genes.gmt
2021-11-15 10:12:51,221 Using all annotated genes with GO_ID as background: hsapiens_gene_ensembl
2021-11-15 10:12:51,226 Background: found 19041 genes
2021-11-15 10:12:51,231 Save file of enrichment results: Job Id:140079037901232
2021-11-15 10:12:51,363 Done.

enr2.results.head(5)

绘图:

# simple plotting function
from gseapy.plot import barplot, dotplot

# to save your figure, make sure that ``ofname`` is not None
barplot(enr.res2d,title='KEGG_2013',)

点图:

# to save your figure, make sure that ``ofname`` is not None
dotplot(enr.res2d, title='KEGG_2013',cmap='viridis_r')

Prerank

读取排序好的数据,可以用差异结果的基因名和 log2FoldChange:

rnk = pd.read_csv("./tests/data/edb/gsea_data.gsea_data.rnk", header=None, sep="\t")
rnk.head()

富集:

# run prerank
# enrichr libraries are supported by prerank module. Just provide the name
# use 4 process to acceralate the permutation speed

# note: multiprocessing may not work on windows
pre_res = gp.prerank(rnk=rnk, gene_sets='KEGG_2016',
                     processes=4,
                     permutation_num=100, # reduce number to speed up testing
                     outdir='test/prerank_report_kegg', format='png', seed=6)

查看结果:

#access results through obj.res2d attribute or obj.results
pre_res.res2d.sort_index().head()

查看通路名称:

# extract geneset terms in res2d
terms = pre_res.res2d.index
terms

Index(['Pathways in cancer Homo sapiens hsa05200',
       'Cytokine-cytokine receptor interaction Homo sapiens hsa04060',
       'HTLV-I infection Homo sapiens hsa05166',
       'MAPK signaling pathway Homo sapiens hsa04010',
       'Rap1 signaling pathway Homo sapiens hsa04015',
       'PI3K-Akt signaling pathway Homo sapiens hsa04151',
       'Focal adhesion Homo sapiens hsa04510',
       'Ras signaling pathway Homo sapiens hsa04014',
       'Metabolic pathways Homo sapiens hsa01100'],
      dtype='object', name='Term')

绘制 gsea 图:

## easy way
from gseapy.plot import gseaplot

# to save your figure, make sure that ofname is not None
gseaplot(rank_metric=pre_res.ranking, term=terms[0], **pre_res.results[terms[0]])

# save figure
# gseaplot(rank_metric=pre_res.ranking, term=terms[0], ofname='your.plot.pdf', **pre_res.results[terms[0]])

读取gct 格式的数据,样本分组:

phenoA, phenoB, class_vector =  gp.parser.gsea_cls_parser("./tests/data/P53.cls")

#class_vector used to indicate group attributes for each sample
print(class_vector)

['MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'MUT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT', 'WT']

表达量矩阵:

gene_exp = pd.read_csv("./tests/data/P53.txt", sep="\t")
gene_exp.head()

查看关联:

print("positively correlated: ", phenoA)
positively correlated:  MUT

print("negtively correlated: ", phenoB)
negtively correlated:  WT

富集分析:

# run gsea
# enrichr libraries are supported by gsea module. Just provide the name

gs_res = gp.gsea(data=gene_exp, # or data='./P53_resampling_data.txt'
                 gene_sets='KEGG_2016', # enrichr library names
                 cls= './tests/data/P53.cls', # cls=class_vector
                 # set permutation_type to phenotype if samples >=15
                 permutation_type='phenotype',
                 permutation_num=100, # reduce number to speed up test
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 method='signal_to_noise',
                 processes=4, seed= 7,
                 format='png')

#access the dataframe results throught res2d attribute
gs_res.res2d.sort_index().head()

绘制热图:

# plotting heatmap
genes = gs_res.res2d.genes[0].split(";")
# Make sure that ``ofname`` is not None, if you want to save your figure to disk
heatmap(df = gs_res.heatmat.loc[genes], z_score=0, title=terms[0], figsize=(18,6))

5结尾

剩下的大家可以看看教程。

https://gseapy.readthedocs.io/en/master/gseapy_example.html

欢迎加入生信交流群。加我微信我也拉你进微信群聊 老俊俊生信交流群 哦,数据代码已上传至QQ群,欢迎加入下载。

群二维码：

老俊俊微信：

知识星球：

所以今天你学习了吗？

今天的分享就到这里了，敬请期待下一篇！

最后欢迎大家分享转发，您的点赞是对我的鼓励和肯定！

如果觉得对您帮助很大，赏杯快乐水喝喝吧！

往期回顾

◀pysam 读取 bam 文件准备 Ribo-seq 质控数据

◀sankeywheel 绘制唯美桑基图

◀ggplot 绘制小提琴图+箱线图+统计检验

◀Ribo-seq 数据上游分析

◀关于序列提取的问题思考

◀跟着 Cell Reports 学做图-CLIP-seq 数据可视化

◀m6A enrichment on peak center

◀m6A metagene distribution 纠正坐标

◀ggplot 绘制箱线图

◀python 查找字符串

◀...

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

有的人走了，却永远活着

圈内疯传某谣言

不要放过这些人渣

最大“独立日烟花秀”存恐怖威胁纽约市警全力部署

GSEApy 做富集分析及可视化

1引言

2安装

3使用示例

命令行版:

python IDE 里使用:

基于数据框,字典,列表等格式:

富集分析

自带部分基因集

4功能演示教程

ID 转换

Enrichr 富集分析

Prerank

5结尾

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

有的人走了，却永远活着

圈内疯传某谣言

不要放过这些人渣

最大“独立日烟花秀”存恐怖威胁 纽约市警全力部署

生成图片，分享到微信朋友圈

GSEApy 做富集分析及可视化

1引言

2安装

3使用示例

命令行版:

python IDE 里使用:

基于数据框,字典,列表等格式:

富集分析

自带部分基因集

4功能演示教程

ID 转换

Enrichr 富集分析

Prerank

5结尾

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

最大“独立日烟花秀”存恐怖威胁纽约市警全力部署