aglient芯片原始数据处理

Original 生信技能树生信技能树 2022-08-24

导读

我多次在学徒作业强调了 3大基因芯片产商里面，就Agilent公司的芯片比较难搞，比如Agilent芯片表达矩阵处理(学徒作业) 以及 oligo包可以处理agilent芯片吗，这个作业难度非常高，不过我们生信技能树优秀讲师：小洁在繁重的授课压力下抽空整理了相关数据处理经验分享给大家，下面看她的表演：

本文讲的是aglient芯片原始数据的处理，参考资料是limma 的userguide文档。GEO数据库下载的表达矩阵不符合预期，比如是空的，或者是有负值的，那我们就处理一下它的原始数据。aglient的芯片应用也很广泛，举个OSCC的栗子：GSE23558，跟着学习学习。

1.下载和读取数据

1.1获取临床信息数据

从前，提到GEO数据下载，我们只有GEOquery，神功盖世，但是死于网速。后来就有了中国人寄几的GEO镜像，AnnoProbe包。还没有正式发表，就已经初露锋芒了，因简单易学，下载迅速，在我们的粉丝圈子里很受欢迎。

rm(list = ls())
library(stringr)
library(AnnoProbe)
library(GEOquery)
library(limma)
gse = "GSE23558"
geoChina(gse)

## you can also use getGEO from GEOquery, by 
## getGEO("GSE23558", destdir=".", AnnotGPL = F, getGPL = F)

## $GSE23558_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 41000 features, 32 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM577914 GSM577915 ... GSM577945 (32 total)
##   varLabels: title geo_accession ... tissue:ch1 (42 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22072328
## 28433800 
## Annotation: GPL6480

提供一个GSE编号就可以下载啦。因为表达矩阵是处理过的，我们不要，所以只提取临床信息表格，从中获得分组信息。

load("GSE23558_eSet.Rdata")
pd <- pData(gset[[1]])

调整pd的行名与文件读取的顺序一致，并定义分组信息。

raw_dir = "rawdata/GSE23558_RAW/"
raw_datas = paste0(raw_dir,"/",dir(raw_dir))

#调整pd与rawdata的顺序一致
raw_order = str_extract(raw_datas,"GSM\\d*")
pd = pd[match(raw_order,rownames(pd)),]

group_list <- ifelse(stringr::str_detect(pd$title,"HealthyControl"),"normal","tumor")
group_list <- factor(group_list, levels=c("normal","tumor"))

1.2 读取原始数据

这个原始数据下载，在GEO主页，可能对大家的网络有一点点要求，可以参考：下载GEO数据太慢？快用axel

x <- read.maimages(raw_datas,
                   source="agilent", 
                   green.only=TRUE,
                   other.columns = "gIsWellAboveBG")

## Read rawdata/GSE23558_RAW//GSM577914.txt 
## Read rawdata/GSE23558_RAW//GSM577915.txt 
## Read rawdata/GSE23558_RAW//GSM577916.txt 
## Read rawdata/GSE23558_RAW//GSM577917.txt 
## Read rawdata/GSE23558_RAW//GSM577918.txt 
## Read rawdata/GSE23558_RAW//GSM577919.txt 
## Read rawdata/GSE23558_RAW//GSM577920.txt 
## Read rawdata/GSE23558_RAW//GSM577921.txt 
## Read rawdata/GSE23558_RAW//GSM577922.txt 
## Read rawdata/GSE23558_RAW//GSM577923.txt 
## Read rawdata/GSE23558_RAW//GSM577924.txt 
## Read rawdata/GSE23558_RAW//GSM577925.txt 
## Read rawdata/GSE23558_RAW//GSM577926.txt 
## Read rawdata/GSE23558_RAW//GSM577927.txt 
## Read rawdata/GSE23558_RAW//GSM577928.txt 
## Read rawdata/GSE23558_RAW//GSM577929.txt 
## Read rawdata/GSE23558_RAW//GSM577930.txt 
## Read rawdata/GSE23558_RAW//GSM577931.txt 
## Read rawdata/GSE23558_RAW//GSM577932.txt 
## Read rawdata/GSE23558_RAW//GSM577933.txt 
## Read rawdata/GSE23558_RAW//GSM577934.txt 
## Read rawdata/GSE23558_RAW//GSM577935.txt 
## Read rawdata/GSE23558_RAW//GSM577936.txt 
## Read rawdata/GSE23558_RAW//GSM577937.txt 
## Read rawdata/GSE23558_RAW//GSM577938.txt 
## Read rawdata/GSE23558_RAW//GSM577939.txt 
## Read rawdata/GSE23558_RAW//GSM577940.txt 
## Read rawdata/GSE23558_RAW//GSM577941.txt 
## Read rawdata/GSE23558_RAW//GSM577942.txt 
## Read rawdata/GSE23558_RAW//GSM577943.txt 
## Read rawdata/GSE23558_RAW//GSM577944.txt 
## Read rawdata/GSE23558_RAW//GSM577945.txt

dim(x)

## [1] 45015    32

2.背景校正和标准化

y <- backgroundCorrect(x, method="normexp")

## Array 1 corrected
## Array 2 corrected
## Array 3 corrected
## Array 4 corrected
## Array 5 corrected
## Array 6 corrected
## Array 7 corrected
## Array 8 corrected
## Array 9 corrected
## Array 10 corrected
## Array 11 corrected
## Array 12 corrected
## Array 13 corrected
## Array 14 corrected
## Array 15 corrected
## Array 16 corrected
## Array 17 corrected
## Array 18 corrected
## Array 19 corrected
## Array 20 corrected
## Array 21 corrected
## Array 22 corrected
## Array 23 corrected
## Array 24 corrected
## Array 25 corrected
## Array 26 corrected
## Array 27 corrected
## Array 28 corrected
## Array 29 corrected
## Array 30 corrected
## Array 31 corrected
## Array 32 corrected

y <- normalizeBetweenArrays(y, method="quantile")
class(y)

## [1] "EList"
## attr(,"package")
## [1] "limma"

3. 基因过滤

去除对照探针
去除匹配不到genesymbol的探针
去除不表达的探针，去除的标准是：至少在一半样本中高于背景，通过y(other)gIsWellAboveBG来判断。
我自己加上了一个，测到多次的基因，只保留一个探针。

Control <- y$genes$ControlType==1L;table(Control)

## Control
## FALSE  TRUE 
## 43529  1486

NoSymbol <- is.na(y$genes$GeneName);table(NoSymbol)

## NoSymbol
## FALSE 
## 45015

IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 16;table(IsExpr)

## IsExpr
## FALSE  TRUE 
## 13088 31927

Isdup <- duplicated(y$genes$GeneName);table(Isdup)

## Isdup
## FALSE  TRUE 
## 30328 14687

yfilt <- y[!Control & !NoSymbol & IsExpr & !Isdup, ]
dim(yfilt)

## [1] 20650    32

可以看到，过滤后剩下了2万多个探针。

4.得到表达矩阵

exp = yfilt@.Data[[1]]
boxplot(exp)

exp[1:2,1:2]

##      rawdata/GSE23558_RAW//GSM577914 rawdata/GSE23558_RAW//GSM577915
## [1,]                        9.284154                       11.473334
## [2,]                        7.341236                        7.474406

得到的表达矩阵没问题，但行名和列名均有问题。行名应该是探针名，列名是样本名，调整一下。

4.1获得样本名

colnames(exp) = str_extract(colnames(exp),"GSM\\d*")
exp[1:2,1:2]

##      GSM577914 GSM577915
## [1,]  9.284154 11.473334
## [2,]  7.341236  7.474406

4.2 获得基因名

limma文档里写的是用了注释R包，在本例的原文件是里有探针注释的，这里直接使用。

可以直接将exp的行名转为基因名。行名不能重复，所以先去重

anno = yfilt$genes
nrow(anno);nrow(exp)

## [1] 20650

## [1] 20650

rownames(exp)=rownames(anno)
ids = unique(anno$GeneName)
exp = exp[!duplicated(anno$GeneName),]
rownames(exp) = ids
exp[1:4,1:4]

##             GSM577914 GSM577915 GSM577916 GSM577917
## APOBEC3B     9.284154 11.473334 10.439071 11.661000
## A_32_P77178  7.341236  7.474406  7.310818  7.397149
## ATP11B       9.963452  8.915621 10.193873  9.321954
## DNAJA1      13.469790 13.201078 12.827357 13.389431

至此，得到了标准的表达矩阵。后面要做什么就看你啦，这就相当于修复了一下数据库里那个被标准化过的表达矩阵。

5.差异分析

design <- model.matrix(~group_list)
fit <- lmFit(exp,design)
fit <- eBayes(fit,trend=TRUE,robust=TRUE)
summary(decideTests(fit))

##        (Intercept) group_listtumor
## Down             0            2102
## NotSig           0           16928
## Up           20650            1620

deg = topTable(fit,coef=2,n=dim(y)[1])
boxplot(exp[rownames(deg)[1],]~group_list)

这里直接走limma的简易流程，可以画差异最显著的那个基因表达量看看，可以看到差异是超级明显了！

save(exp,group_list,deg,gse,file = paste0(gse,"deg.Rdata"))

后面的步骤就是我们GEO数据挖掘课程的标配啦，如果大家对这一系列“骚操作”感兴趣，欢迎报名我们的GEO数据挖掘课程哈，全年滚动开班，直播互动教学以及答疑，下一期是7月6号开课，可以花时间了解一下：

生信爆款入门-全球听（买一得五）（第5期）（可能是最后一期）你的生物信息学入门课
(必看！)数据挖掘第3期（两天变三周，实力加量），医学生/临床医师首选技能提高课

文末友情宣传

强烈建议你推荐我们生信技能树给身边的博士后以及年轻生物学PI，帮助他们多一点数据认知，让科研更上一个台阶：

生信爆款入门-全球听（买一得五）（第5期）（可能是最后一期）你的生物信息学入门课
(必看！)数据挖掘第3期（两天变三周，实力加量），医学生/临床医师首选技能提高课
生信技能树的2019年终总结，你的生物信息学成长宝藏
2020学习主旋律，B站74小时免费教学视频为你领路，还等什么，看啊！！！

高三女生醉酒后被强奸致死？检方回应

常德悲剧：让谴责无差别杀戮之声更加响亮一点

2024【公共营养师】培训报名通道已开启，不限学历，23岁及以上可报！还能领2000补贴

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋华人崩溃大哭连空姐都吐了; 客机颠簸盘旋3小时

女人最偏爱的十种男人

aglient芯片原始数据处理

1.下载和读取数据

1.1获取临床信息数据

1.2 读取原始数据

2.背景校正和标准化

3. 基因过滤

4.得到表达矩阵

4.1获得样本名

4.2 获得基因名

5.差异分析

文末友情宣传

您可能也对以下帖子感兴趣

高三女生醉酒后被强奸致死？检方回应

常德悲剧：让谴责无差别杀戮之声更加响亮一点

2024【公共营养师】培训报名通道已开启，不限学历，23岁及以上可报！还能领2000补贴

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋 华人崩溃大哭 连空姐都吐了; 客机颠簸盘旋3小时

女人最偏爱的十种男人

生成图片，分享到微信朋友圈

aglient芯片原始数据处理

1.下载和读取数据

1.1获取临床信息数据

1.2 读取原始数据

2.背景校正和标准化

3. 基因过滤

4.得到表达矩阵

4.1获得样本名

4.2 获得基因名

5.差异分析

文末友情宣传

您可能也对以下帖子感兴趣

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋华人崩溃大哭连空姐都吐了; 客机颠簸盘旋3小时