TCGA的pan-caner资料大全(以后挖掘TCGA数据库就用它)
随着这28篇TCGA数据库整合挖掘文章出现的是他们团队精心整理好的全套TCGA数据资料供下载,其实就是TCGA的pan-caner项目的产品,全部组学数据都被整理好了,比如:
gene and protein expression
copy number
DNA methylation
somatic mutation
全部文件下载
链接是 https://gdc.cancer.gov/about-data/publications/pancanatlas :
RNA (Final) - EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv
RPPA (Final) - TCGA-RPPA-pancan-clean.txt
DNA Methylation (450K Only) - jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv
DNA Methylation (Merged 27K+450K Only) - jhu-usc.edu_PANCAN_merged_HumanMethylation27_HumanMethylation450.betaValue_whitelisted.tsv
miRNA (Batch Effects Normalized miRNA data)
Sample List - PanCanAtlas_miRNA_sample_information_list.txt
Protocol Platform - pancanMiRs_EBadjOnProtocolPlatformWithoutRepsWithUnCorrectMiRs_08_04_16.csv
Copy Number - broad.mit.edu_PANCAN_Genome_Wide_SNP_6_whitelisted.seg
ABSOLUTE-annotated MAF - TCGA_consolidated.abs_mafs_truncated.fixed.txt.gz
ABSOLUTE-annotated seg file - TCGA_mastercalls.abs_segtabs.fixed.txt
ABSOLUTE purity/ploidy file - TCGA_mastercalls.abs_tables_JSedit.fixed.txt
Mutations - mc3.v0.2.8.PUBLIC.maf.gz
TCGA-Clinical Data Resource (CDR) Outcome* -
其中临床信息也被重新校验了。
TCGA-CDR-SupplementalTableS1.xlsx
A curated resource of the clinical annotations for TCGA data and provides recommendations for use of clinical endpoints
It is strongly recommended that this file be used for clinical elements and survival outcome data first; more details please see the TCGA-CDR paper(link is external).
Clinical with Follow-up - clinical_PANCAN_patient_with_followup.tsv
Merged Sample Quality Annotations - merged_sample_quality_annotations.tsv
PARADIGM Pathway Inference Matrix - merge_merged_reals.tar.gz
RNA-seq数据
介绍一个去除了批次效应并且归一化好了的 RNA-seq表达矩阵
File: EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv
Contains batch normalized RNASeqV2 mRNA data.
20531 genes (rows) x 11069 samples (columns). ~1.6 GB file size.
File: EB++GeneExpAnnotation.tsv
Contains annotations about exactly which samples were adjusted and which weren't
Adjustment procedure:
All Hi-Seq data from UNC were unchanged, with the exception of PRAD (prostate)
All data from BCGSC, whether Hi-Seq or GA, were unchanged
PRAD batch IDs 312 and 320 were adjusted to remove batch effects. Remaining PRAD data were unchanged. See PCA-plus plot BEFORE correction and the justification for correction
All GA samples from UNC were adjusted to remove platform effects between UNC Hi-Seq and GA samples. The tumor types containing UNC GA samples that were adjusted are UCEC, COAD, and READ.
Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.
Genes were adjusted using a novel algorithm called EB++; a variant of Empirical Bayes/ComBat algorithm with training/testing features added.
Future adjustments:
Removal of any platform effects in GA samples vs. Hi-Seq from BCGSC. The tumor types potentially affected will be LAML, STAD, and ESCA. Analysis is pending.
Possible adjustment of all samples from BCGSC to remove center effects between BCGSC and UNC. Tumor types potentially affected will be LAML, STAD, ESCA and OV. Analysis is pending.
Addition of microarray samples for GBM and OV.
Potential adjustment of DLBC for removal of batch effects. Analysis is pending.
网页工具
如果你下载了这么多数据文件,而不会写代码,那就必须求助于网页工具了
Broad Institute FireCloud (link is external)(link is external)The Broad Institute
cBioPortal for Cancer Genomics (link is external)(link is external)Memorial Sloan-Kettering Cancer Center
Next-Generation Clustered Heat Maps (link is external)(link is external)MD Anderson Cancer Center
如果你完全没有看懂我在讲什么,那你可能需要下面的课程:
生信技能树(爆款入门培训课)巡讲第一站-重庆 (已结束)
生物信息学全国巡讲之粤港澳大湾区专场 (已结束)
生信技能树(爆款入门培训课)巡讲第二站-济南 (已结束)
接下来是广州和上海,请联系小助手抢购学习名额吧!
写在最后
因为文中太多链接,所以大家可能需要点击阅读原文去跳转
然后因为这些资源介绍太简单,没有资格列入我的TCGA 28篇教程,所以大家就随意看看。
TCGA的28篇教程-使用R语言的cgdsr包获取TCGA数据(cBioPortal)
TCGA的28篇教程-使用R语言的RTCGA包获取TCGA数据 (离线打包版本)
TCGA的28篇教程-使用R语言的RTCGAToolbox包获取TCGA数据 (FireBrowse portal)
TCGA的28篇教程-批量下载TCGA所有数据 ( UCSC的 XENA)