查看原文
其他

一个公共数据集最多可以被挖掘多少次呢?

生信技能树 生信技能树 2022-06-06

是我太年轻

学员群有咨询  Agilent-038314 CBC Homo sapiens lncRNA + mRNA microarray V2.0 这个表达量芯片的数据处理问题,当然了,主要是芯片的探针ID对应基因名字的问题。 链接是;https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18109

因为大家还是初学者,所以我就想着先打击一下,说这样的芯片比较难,肯定是很少有人挖掘它,因为它仅仅是提供了探针的碱基序列,有一个费时费力的流程去拿到该芯片的注释信息:

 

比如我们可以看到《LncRNA and mRNA integration network reconstruction reveals novel key regulators in esophageal squamous-cell carcinoma》 这个2019的文献,链接是: https://doi.org/10.1016/j.ygeno.2018.01.003

就对这个芯片做了非常复杂的处理:

 

Microarray data contained 71,584 probes. After applying the criteria for re-annotation, 39,068 of the probes were retained, among which 20,323 were corresponding to mRNAs and 18,745 to lncRNA.

最后这些探针还需要去冗余,得到:These probes were mapped to 25,018 unique genes, including 13,490 protein coding genes (PCGs) and 11,528 lncRNAs.

而且这些数据,文章都整理好了,都在附件:

 

现在你还在发愁,这样的芯片,如何做ID转化吗?

让我吃惊的是

出于职业习惯,我去看了看这个数据集页面: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53625

发现它居然链接到了4个文献:

  • Li J, Chen Z, Tian L, Zhou C et al. LncRNA profile study reveals a three-lncRNA signature associated with the survival of patients with oesophageal squamous cell carcinoma. Gut 2014 Nov;63(11):1700-10. PMID: 24522499
  • Li W, Zhang L, Guo B, Deng J et al. Exosomal FMR1-AS1 facilitates maintaining cancer stem-like cell dynamic equilibrium via TLR7/NFκB/c-Myc signaling in female esophageal carcinoma. Mol Cancer 2019 Feb 8;18(1):22. PMID: 30736860
  • Li Y, Lu Z, Che Y, Wang J et al. Immune signature profiling identified predictive and prognostic factors for esophageal squamous cell carcinoma. Oncoimmunology 2017;6(11):e1356147. PMID: 29147607
  • Shi X, Chen Z, Hu X, Luo M et al. AJUBA promotes the migration and invasion of esophageal squamous cell carcinoma cells through upregulation of MMP10 and MMP13 expression. Oncotarget 2016 Jun 14;7(24):36407-36418. PMID: 27172796
  • Liu J, Wang Y, Chu Y, Xu R et al. Identification of a TLR-Induced Four-lncRNA Signature as a Novel Prognostic Biomarker in Esophageal Carcinoma. Front Cell Dev Biol 2020;8:649. PMID: 32850794

我又好奇去谷歌搜了一下这个数据集:

 

粗略看了看,起码几十篇数据挖掘文献, 就针对这一个数据集,变着花样各种挖,活脱脱的一个数据挖掘发展史!

文末友情推荐

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存