(19)一个affymetrix表达芯片实战-生信菜鸟团博客2周年精选文章集

查看原文

其他

(19)一个affymetrix表达芯片实战-生信菜鸟团博客2周年精选文章集

原创 2017-01-10 1227278128 生信技能树

这个实例上部分包括：

如何用R包下载GEO数据(只限单一平台，其余平台需要修改下面的代码)

如何对GEO的芯片数据归一化并且得到表达量矩阵，

如何用limma包做差异分析，

对找到的差异基因如何做GO和KEGG注释

首先下载两个GEO数据：

平台是：Affymetrix U133 gene chips

67 diseased triple negative breast cancer samples（GSE31519 ）and 42 control samples (GSE20437）

都是表达量数据，同一种芯片。分成两组，正好做差异表达。

数据来源的文献是：

文章title：A clinically relevant gene signature in triple negative and basal-like breast cancer

结论（We describe a ratio of high B-cell presence and low IL-8 activity as a powerful new prognostic marker for TNBC. ）

地址：

GEO数据地址:

Platform: GPL96 67 Samples

Download data: GEO (CEL, TXT)

SeriesAccession: GSE31519ID: 200031519

文章title：Histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patient

GEO数据地址:

Platform: GPL96 67 Samples

Download data: GEO (CEL, TXT)

SeriesAccession: GSE31519 ID: 200031519

Platform: GPL96 Series: GSE20437 42 Samples

Download data: GEO (CEL)

DataSetAccession: GDS3716 ID: 3716

我首先用R的GEOquery包来下载。(其实你完全可以直接去GEO网站下载数据，然后解压的)

suppressMessages(library(GEOquery))

setwd("D:\\test_analysis\\TNBC")

gse31519=getGEO("GSE31519",GSEMatrix = T,destdir = "./")

getGEOSuppFiles("GSE31519",baseDir = "./")

gse31519=getGEO("GSE20437",GSEMatrix = T,destdir = "./")

getGEOSuppFiles("GSE20437",baseDir = "./")

这样下载之后的数据都存在D:\\test_analysis\\TNBC下面

接下来我们就用affy包和limma来进行差异分析：

library(affy)

library(limma)

affy.data=ReadAffy(celfile.path="./cel_files")

请先搞清楚，ReadAffy 这个函数的用法！

当前工作目录下面有没有cel_files文件夹?

如果你搞不清楚什么是工作变量，什么是函数，什么是参数，如何赋值，请不要再看下去了，我懒得跟你说！

cel_files文件夹下面有没有文件？

eset.rma=rma(affy.data)

exprSet=exprs(eset.rma)

write.table(exprSet,"expr_rma_matrix.txt",quote=F,sep="\t")

group=factor(c(rep("control",42),rep("case",67)))

design = model.matrix(~0+group)

colnames(design)=c("case","control")

rownames(design)=sampleNames(affy.data)

fit=lmFit(exprSet,design)

cont.matrix = makeContrasts(contrasts="case-control",levels=design)

fit2=contrasts.fit(fit,cont.matrix)

fit2=eBayes(fit2)

diff_dat=topTable(fit2,coef=1,n=Inf)

write.table(diff_dat,"diff_dat.txt",quote=F)

这样得到的diff_dat就是我们差异分析的结果啦

we choose the log fold cut off change to be “2” to get a manageable set of genes.
原文说：we were able to get a list of 2567 genes after removing the duplicates and the not available genes

我们仅仅根据一个标准来挑选差异基因， the log fold cut off change to be “2”，我只挑出来了782个探针

接下来对这些探针进行注释，得到基因名，我这里用biomart包来进行注释

我们的平台是：Affymetrix U133 gene chips，虽然有22283个探针，但是只有13908个基因

所以代码如下：

ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")

gene_probe=getBM(attributes=c("hgnc_symbol","affy_hg_u133a"),filter="affy_hg_u133a",value=rownames(diff_dat),mart=ensembl)

diff_probe=rownames(diff_dat[abs(diff_dat[,1])>2,])

diff_gene=gene_probe[match(diff_probe,gene_probe[,2]),1]

diff_gene=na.omit(diff_gene)

diff_gene=unique(diff_gene)

length(diff_gene)

这样会得到604个差异基因

然后我做一下GO和KEGG的富集分析！

gene_entrez=getBM(attributes=c("hgnc_symbol","entrezgene"),filter="hgnc_symbol",value=diff_gene,mart=ensembl)

## 必须安利一下Y叔的包，希望Y叔看到了会表扬我！

## 如果需要知道Y叔这个包的详细用法，请直接联系他本人！

require(DOSE)

require(clusterProfiler)

gene_entrez=na.omit(gene_entrez)

gene=as.character(gene_entrez[,2])

ego <- enrichGO(gene=gene,organism="human",ont="CC",pvalueCutoff=0.01,readable=TRUE)

ekk <- enrichKEGG(gene=gene,organism="human",pvalueCutoff=0.01,readable=TRUE)

write.csv(summary(ekk),"KEGG-enrich.csv",row.names =F)

write.csv(summary(ego),GO-enrich.csv,row.names =F)

懒得上传图片了，大家可以用同样的代码自己实现所有的流程

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

要么空仓！要么盯紧这个！

劲爆！为了姜萍两位女CEO互揭老底！