单细胞转录组高级分析六:TCGA生存分析
单细胞转录组基础分析专题
We next utilized clinical data and bulk mRNA-seq data from The Cancer Genome Atlas (TCGA) (Liu et al., 2018) to determine whether CD4+ TFH signatures in HNSCC were related to survival (STAR Methods).
本篇文章将模拟原文的过程做一个生存分析,但是我在原文中找不到TFH signatures对应的200个基因,所以只能用其中的几个基因模拟一下。本次分析的结果与原文会有一定差异,大家关注生存分析如何操作即可。
UCSC Xena简介
UCSC是加利福尼亚大学圣克鲁兹分校(University of California, Santa Cruz)的简称,UCSC Xena是他们开发的一个癌症基因组学数据分析平台。该平台整合了多个癌症公共数据库的资源,比如来自TCGA, ICGC等大型癌症研究项目的数据,不仅可以下载各个数据集,还提供在线分析功能。
##安装生存分析R包
BiocManager::install(c("survival","survminer"), update = F)
##读取HNSC数据集的表达矩阵和临床信息
library(survival)
library(survminer)
library(tidyverse)
dir.create("Survival")
TCGAexpr <- read.delim("Data/TCGA-HNSC.htseq_fpkm.tsv", check.names=F)
id.annota <- read.delim("Data/gencode.v22.annotation.gene.probeMap", check.names=F)
id.annota <- id.annota[, 1:2]
TCGAexpr <- merge(id.annota, TCGAexpr, by.x="id", by.y="Ensembl_ID")
TCGAexpr <- aggregate(TCGAexpr[,-c(1,2)], list(TCGAexpr$gene), FUN=sum)
TCGAexpr <- column_to_rownames(TCGAexpr,"Group.1")
#过滤表达量过低的基因,减少分析数据量
TCGAexpr <- TCGAexpr[rowSums(TCGAexpr)>10, ] %>% as.matrix()
TCGAclin <- read.delim("Data/TCGA-HNSC.survival.tsv", check.names = F)
TCGAclin <- TCGAclin[TCGAclin$sample %in% names(TCGAexpr),]
TCGAclin <- TCGAclin[,-3]
we first defined our TFH gene set based on the top 200 differentially upregulated genes between CD4+ Tconv cluster 1 and CD4+ Tconv cluster 7, which represent the two terminal-most differentiated states of CD4+ T cells. As a test statistic for enrichment in the bulk expression profile from each patient, we used the Kolmogorov-Smirnov (KS) test to compare genes in the gene set versus those not in the gene set.
TFHset = list(c('PDCD1', 'CXCL13', 'TIGIT', 'TOX', 'MAF', 'CXCR5', 'CD40LG', 'IL6ST'))
names(TFHset) <- 'TFHset'
ES <- gsva(TCGAexpr, TFHset)
ES <- data.frame(t(ES), check.names=F)
TCGAclin <- merge(TCGAclin, ES, by.x="sample", by.y=0)
TCGAclin$TFH.set = ifelse(TCGAclin$TFHset>median(TCGAclin$TFHset),'high','low')
png('Survival/survival.png', 800, 600)
ggsurvplot(survfit(Surv(OS.time, OS)~TFH.set, data=TCGAclin), conf.int=F, pval=TRUE)
dev.off()
write.csv(TCGAclin, 'Survival/TCGAclin.csv', row.names = F)
res.cox <- coxph(Surv(OS.time, OS)~TFHset, data=TCGAclin)
res.cox
# Call:
# coxph(formula = Surv(OS.time, OS) ~ TFHset, data = TCGAclin)
# coef exp(coef) se(coef) z p
# TFHset -0.3702 0.6906 0.1371 -2.7 0.00694
# Likelihood ratio test=7.33 on 1 df, p=0.006786
# n= 545, number of events= 252
z值:Wald统计值,它对应于每个回归系数与其标准误差的比率(z = coef / se(coef))。
coef:回归系数,正数表示危险(死亡风险)较高,负数代表危险较低。
exp(coef):在此代表危险比(HR,hazard ratio)。
本教程的目的在于把单细胞分析流程串起来,适合有一定R语言基础的朋友参考。如果您学习本教程有一定困难,可以分享此篇文章到朋友圈,截图微信发给Kinesin(文末二维码),我会抽时间组群直播讲解单细胞数据分析的全过程。本专题所用的软件、数据、代码脚本和中间结果,我打包放在了百度云上,需要的朋友添加Kinesin微信索取。