蛋白质组学文献数据分析全流程复现二选一（组建学习小组）

生信技能树 2022-06-06

还记得生信技能树的传送门系列吗？转录组、甲基化、ChIP-Seq、lncRNA、编程实战、还有已经在更新中的Hi-C…

都非常成功，培养了非常多的技能树优秀小伙伴，形成了华语圈最大的生物信息学交流社群，而且这些组学实战，我都录制了完整视频在B站免费发布供十万人学习：

虽然作为全网第一的全栈生信工程师，我的技能已经足够齐全，但也有照顾不到的地方，一直在寻找能传承我理念的小伙伴举起大旗，再创辉煌，比如蛋白质组和代谢组学，我就很少涉猎。不过经过为期半年的考核，找到了两位经验丰富的工程师，也愿意完完整整分享并且带领大家学习，我这里抛砖引玉分享两个待复现分析流程的文献，大家二选一，我们下周正式发布学习小组通知！

比较小鼠胎盘发育过程的转录组和蛋白质技术测量的基因表达变化的差异

蛋白质组学数据分析环节

原始数据是公开的：

Raw data files and MaxQuant search results have been deposited in the Mass Spectrometry Interactive Virtual Environment (MassIVE) repository: https://massive.ucsd. edu/ProteoSAFe/static/massive.jsp with data set identifier: MSV000082849.

首先需要得到表达矩阵：

然后需要对表达矩阵进行质控：

得到的ID也需要转换：

The Ensembl IDs provided by MaxQuant were converted to Mouse Genome Informatics (MGI) symbols

流程如下：

Figure 1. (a) Whole proteome summary of acquired spectra, peptides, and proteins.

这里需要注意 Whole proteome 和 Phosphoproteome的区别：

差异分析

与转录组数据分析没有显著差异，都是基于表达矩阵，火山图如下：

功能数据库注释

作者展示 Gene Ontology Biological Process (GOBP) ：

蛋白质组学和转录组学联合分析

主要是看两个技术的重合情况：

Overlap of upregulated proteins (UPs) and upregulated transcripts (UTs) for genes detected at both the protein and transcript levels at
E7.5 (a) and at E9.5 (b).

重要的结论也是基于这个韦恩图：

We focused this analysis on the 6170 genes that were detected in both the proteome data set and the transcriptome data set.

1178 of these genes were upregulated at E7.5 at the protein or transcript level (Figure 2a), of which only 21.3% were upregulated at both levels.
At E9.5, 1295 genes were upregulated at the protein or transcript level (Figure 2b), of which only 24.7% were upregulated at both levels (Figure 2b).

再次证实了RNA水平的基因表达并不能真实反映蛋白水平。

下面看第二篇文献：

研究 ARID1A-mutated Ovarian Clear Cell Carcinoma

所以使用 ARID1A-mutated OCCC cell line, OVISE

还有干扰一个本来是ARID1A基因野生型的细胞系，Knockout of ARID1A in an ovarian clear cell carcinoma cell line with wild-type ARID1A, OVCA429

蛋白质组学数据分析环节

原始数据是公开的，而且在两个主流数据库都有：

Data Availability—The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http:// proteomecentral.proteomexchange.org) via the MassIVE partner repository (http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) with the data set identifier PXD004570.

首先需要得到表达矩阵，这里两个细胞系都做了6个数据，合起来是12个数据：

然后需要对表达矩阵进行质控：