RNA-seq入门实战(十一):WGCNA加权基因共表达网络分析——关联基因模块与表型
连续两次求贤令:曾经我给你带来了十万用户,但现在祝你倒闭,以及 生信技能树知识整理实习生招募,让我走大运结识了几位优秀小伙伴!大家开始根据我的ngs组学视频进行一系列公共数据集分析实战,其中几个小伙伴让我非常惊喜,不需要怎么沟通和指导,就默默的完成了一个实战!
他前面的分享是:
下面是他对我们b站转录组视频课程的详细笔记
承接上节:RNA-seq入门实战(四):差异分析前的准备——数据检查,以及 RNA-seq入门实战(五):差异分析——DESeq2 edgeR limma的使用与比较
我们《生信技能树》这些年有很多关于WGCNA的实战细节建议分享,见:
一文学会WGCNA分析 一文看懂WGCNA 分析(2019更新版) (点击阅读原文即可拿到测序数据) 明码标价之公共数据集的WGCNA 通过WGCNA作者的测试数据来学习 重复一篇WGCNA分析的文章(代码版) 重复一篇WGCNA分析的文章(解读版)(逆向收费读文献2019-19)
其实WGCNA本身是对基因进行合理(加权共表达)的分组。
如果一个表达量矩阵, 里面的样品是两个分组,比如正常和对照,那么简单的差异分析就可以拿到上下调基因,各自可以去富集生物学通路,就是基因分组了,并没有太多的进行WGCNA分析的必要性,而且绝大部分的两个分组的表达量矩阵里面的样品数量通常是小于15个的,官方也并不推荐WGCNA分析。 如果一个表达量矩阵,里面的样品是时间序列这样的多分组,比如处理前后以及处理过程的不同时间梯度或者不同浓度,那么我们就需要每个分组都去跟对照组进行差异分析,上下调的组合非常多,结果也很难精炼出生物学结论,这个时候就可以选择WGCNA或者mfuzz这样的时间序列分析。
也就是说,只要是多分组,就涉及到多次差异分析,而且多分组意味着样品数量肯定不少,这样的话,在这个表达量矩阵里面,不同基因之间可以计算合理的相关性, 就可以根据基因之间的相似性进行基因划分为不同的模块了。
本节概览
1. WGCNA基本概念:定义、关键术语、基本流程、一些注意事项
2. WGCNA运行:
⓪输入数据准备
①判断数据质量,绘制样品的系统聚类树
②挑选最佳阈值power
③ 构建加权共表达网络( 一步法和分步法),识别基因模块
④ 关联基因模块与表型:模块与表型相关性热图、模块与表型相关性boxplot图、基因与模块、表型相关性散点图
⑤ WGCNA的标配热图 ,模块相关性展示
⑥ 对感兴趣模块的基因进行批量GO分析
⑦ 感兴趣模块绘制热图
⑧ 提取感兴趣模块的基因名, 导出基因至 VisANT 或 cytoscape作图
简单来说,WGCNA其实相当于是对多个复杂分组进行的差异分析,用于找寻不同分组/表型的特征基因模块,从而进行下一步分析(如可以对模块内的基因进行GO富集、PPI分析等等).其最最核心之处就在于能将基因模块与样本表型进行关联。
1. WGCNA基本概念
1.1 定义
WGCNA(Weighted Gene Co-Expression Network Analysis ),即加权基因共表达网络分析,用于寻找高度相关的基因构成的基因模块module,利用模块特征基因eigengene(模块内第一主成分)或模块内的关键基因Hub gene来总结这些模块,将模块与样本性状进行关联。
1.2 其他关键术语
关键术语(Connectivity、Adjacency matrix、TOM等)见:
Simulated-00-Background.pdf (ucla.edu)一文看懂WGCNA 分析WGCNA分析,简单全面的最新教程
1.3 基本流程
构建基因共表达网络 >> 识别基因模块 >> 关联基因模块与表型 >> 研究基因模块间关系 >> 从感兴趣的基因模块中寻找关键驱动基因
1.4 一些注意事项
WGCNA package: Frequently Asked Questions (ucla.edu)
样本数 >= 15
基因过滤方法采用均值、方差、中位数、绝对中位差MAD等方法,过滤低表达或样本间变化小的基因,但不建议用差异分析的方法进行过滤
输入数据形式如果有批次效应,需要先进行去除;
处理RNAseq数据,需要采用DESeq2的varianceStabilizingTransformation方法,或将基因标准化后的数据(如FPKM、CPM等)进行log2(x+1)转化经验软阈值power当无向网络在power小于15或有向网络power小于30内,计算出的power无法达到要求时(即没有一个power值可以使无标度网络图谱结构R^2达到0.8且平均连接度降到100以下),可采用经验power值:
2. WGCNA运行
主要参考修改自以下代码:
GEO/task4-NPC at master · jmzeng1314/GEO · GitHub手把手10分文章WGCNA复现:小胶质细胞亚群在脑发育时髓鞘形成的作用
以下实战数据来自于https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154290,Supplementary file中的GSE154290_RNA.FPKM.txt.gz文件
⓪ 输入数据准备
读取fpkm表达矩阵,进行log2(x+1)转化;最后对矩阵进行转置,变成行为样品、列为基因的形式
rm(list = ls())
options(stringsAsFactors = F)
Sys.setenv(LANGUAGE = "en")
library(WGCNA)
library(FactoMineR)
library(factoextra)
library(tidyverse) # ggplot2 stringer dplyr tidyr readr purrr tibble forcats
library(data.table) #多核读取文件
dir.create('8.WGCNA')
setwd('8.WGCNA')
### 启用WGCNA多核计算
enableWGCNAThreads(nThreads = 0.75*parallel::detectCores())
################################# 0.输入数据准备 ################################
### 读取表达矩阵
if (T) {
fpkm00 <- fread("../GSE154290_RNA.FPKM.txt.gz",data.table = F)
### 合并具有相同基因名的行
table(duplicated(fpkm00$gene)) #统计重复基因名的基因
gene <- fpkm00$gene ; fpkm00 <- fpkm00[,-1]
fpkm0 <- aggregate(fpkm00, by=list(gene), FUN=sum)
fpkm <- column_to_rownames(fpkm0,"Group.1")
}
data <- log2(fpkm+1)
### 筛选MAD前5000的基因
keep_data <- data[order(apply(data,1,mad), decreasing = T)[1:5000],]
### 创建datTraits,包含分组、表型等信息
datTraits <- data.frame(row.names = colnames(data),group=colnames(data))
fix(datTraits)
### 给分组加上编号
grouptype <- data.frame(group=sort(unique(datTraits$group)),
groupNo=1:length(unique(datTraits$group)))
# fix(grouptype)
datTraits$groupNo = "NA"
for(i in 1:nrow(grouptype)){
datTraits[which(datTraits$group == grouptype$group[i]),'groupNo'] <- grouptype$groupNo[i]}
datTraits
### 转置
datExpr0 <- as.data.frame(t(keep_data))
在这里根据细胞的多能性状态,将这些细胞粗略分为3组,并按其发育顺序编号为123,添加到了其表型数据中:
datTraits
① 判断数据质量,绘制样品的系统聚类树
查看样本和基因的数据质量,去除低质量数据;绘制样品的系统聚类树,若存在一个显著离群点则需要剔除掉;还可以做PCA图查看样品分布情况
############################## 1.判断数据质量 ################################
### 判断数据质量--缺失值
gsg <- goodSamplesGenes(datExpr0,verbose = 3)
gsg$allOK
if (!gsg$allOK){
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes],
collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:",
paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}
gsg <- goodSamplesGenes(datExpr0,verbose = 3)
gsg$allOK
### 绘制样品的系统聚类树
if(T){
#针对样本做聚类树
sampleTree <- hclust(dist(datExpr0), method = "average")
par(mar = c(0,5,2,0))
plot(sampleTree, main = "Sample clustering", sub="", xlab="", cex.lab = 2,
cex.axis = 1, cex.main = 1,cex.lab=1)
## 若样本有性状、表型,可以添加对应颜色,查看是否聚类合理
sample_colors <- numbers2colors(as.numeric(factor(datTraits$group)),
colors = rainbow(length(table(datTraits$group))),
signed = FALSE)
## 绘制样品的系统聚类树及对应性状
par(mar = c(1,4,3,1),cex=0.8)
pdf("step1_Sample dendrogram and trait.pdf",width = 8,height = 6)
plotDendroAndColors(sampleTree, sample_colors,
groupLabels = "trait",
cex.dendroLabels = 0.8,
marAll = c(1, 4, 3, 1),
cex.rowText = 0.01,
main = "Sample dendrogram and trait" )
## Plot a line to show the cut
# abline(h = 23500, col = "red") #根据实际情况而定
dev.off()
}
##若存在显著离群点;剔除掉
if(F){
clust <- cutreeStatic(sampleTree, cutHeight = 23500, minSize = 10) # cutHeight根据实际情况而定
table(clust)
keepSamples <- (clust==1)
datExpr0 <- datExpr0[keepSamples, ]
datTraits <- datTraits[keepSamples,]
dim(datExpr0)
}
### 判断数据质量 : PCA进行分组查看
rm(list = ls())
load("step1_input.Rdata")
group_list <- datTraits$group
dat.pca <- PCA(datExpr, graph = F)
pca <- fviz_pca_ind(dat.pca,
title = "Principal Component Analysis",
legend.title = "Groups",
geom.ind = c("point","text"), #"point","text"
pointsize = 2,
labelsize = 4,
repel = TRUE, #标签不重叠
col.ind = group_list, # 分组上色
axes.linetype=NA, # remove axeslines
mean.point=F#去除分组中心点
) +
theme(legend.position = "none")+ # "none" REMOVE legend
coord_fixed(ratio = 1) #坐标轴的纵横比
pca
ggsave(pca,filename= "step1_Sample PCA analysis.pdf", width = 8, height = 8)
##保存数据
datExpr <- datExpr0
nGenes <- ncol(datExpr)
nSamples <- nrow(datExpr)
save(nGenes,nSamples,datExpr,datTraits,file="step1_input.Rdata")
下图可看出此次聚类效果还是比较好的,没有离群值,因此不用进行剔除样本
step1_Sample dendrogram and trait.pdf
step1_Sample PCA analysis.pdf
② 挑选最佳阈值power
挑选标准:R^2 > 0.8 , slope ≈ -1,选择R^2 与 power关系图的拐点处的power值
https://horvath.genetics.ucla.edu/html/GeneralFramework/YeastTutorialHorvath.pdf
sft$powerEstimate可以给出所设定R^2 cut-off(默认为0.85)处的power值
############################### 2.挑选最佳阈值power ###################################
rm(list = ls())
load("step1_input.Rdata")
R.sq_cutoff = 0.8 #设置R^2 cut-off,默认为0.85
if(T){
# Call the network topology analysis function
#设置power参数选择范围
powers <- c(seq(1,20,by = 1), seq(22,30,by = 2))
sft <- pickSoftThreshold(datExpr,
networkType = "unsigned",
powerVector = powers,
RsquaredCut = R.sq_cutoff,
verbose = 5)
#SFT.R.sq > 0.8 , slope ≈ -1
pdf("step2_power-value.pdf",width = 16,height = 12)
# Plot the results: 寻找拐点,确认最终power取值
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n")
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red")
# this line corresponds to using an R^2 cut-off of h
abline(h=R.sq_cutoff ,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n")
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
abline(h=100,col="red")
dev.off()
}
sft$powerEstimate #查看估计的最佳power
# power = sft$powerEstimate
power = 15
# 若无向网络在power小于15或有向网络power小于30内,没有一个power值使
# 无标度网络图谱结构R^2达到0.8且平均连接度在100以下,可能是由于
# 部分样品与其他样品差别太大。这可能由批次效应、样品异质性或实验条件对
# 表达影响太大等造成。可以通过绘制样品聚类查看分组信息和有无异常样品。
# 如果这确实是由有意义的生物变化引起的,也可以使用下面的经验power值。
if(is.na(power)){
# 官方推荐 "signed" 或 "signed hybrid"
# 为与原文档一致,故未修改
type = "unsigned"
nSamples=nrow(datExpr)
power = ifelse(nSamples<20, ifelse(type == "unsigned", 9, 18),
ifelse(nSamples<30, ifelse(type == "unsigned", 8, 16),
ifelse(nSamples<40, ifelse(type == "unsigned", 7, 14),
ifelse(type == "unsigned", 6, 12))
)
)
}
save(sft, power, file = "step2_power_value.Rdata")
从下图可看出拐点大致在power为16时出现,且各项数值基本满足挑选标准,因此设定power=16
step2_power-value.pdf
③ 构建加权共表达网络,识别基因模块
构建加权共表达网络可选择一步法(one step)或 分步法(step by step)进行。
一般情况下优先选择采用简便的一步法,当想要调整得到的基因模块数目时,采用分步法更灵活
一步法构建加权共表达网络,识别基因模块
主要调整参数为minModuleSize和mergeCutHeight , 数值越大模块越少
##################### 3.一步法构建加权共表达网络,识别基因模块 ####################
load(file = "step1_input.Rdata")
load(file = "step2_power_value.Rdata")
if(T){
net <- blockwiseModules(
datExpr,
power = power,
maxBlockSize = ncol(datExpr),
corType = "pearson", #默认为"pearson","bicor"则更能考虑离群点的影响
networkType = "unsigned",
TOMType = "unsigned",
minModuleSize = 30, ##越大模块越少
mergeCutHeight = 0.25, ##越大模块越少
numericLabels = TRUE,
saveTOMs = F,
verbose = 3
)
table(net$colors)
# power: 上一步计算的软阈值
# maxBlockSize:计算机能处理的最大模块的基因数量(默认5000),16G内存可以处理2万个,
# 计算资源允许的情况下最好放在一个block里面。
# corType:计算相关性的方法;可选pearson(默认),bicor。后者更能考虑离群点的影响。
# networkType:计算邻接矩阵时,是否考虑正负相关性;默认为"unsigned",可选"signed", "signed hybrid"
# TOMType:计算TOM矩阵时,是否考虑正负相关性;默认为"signed",可选"unsigned"。但是根据幂律转换的邻接矩阵(权重)的非负性,所以认为这里选择"signed"也没有太多的意义。
# numericLabels: 返回数字而不是颜色作为模块的名字,后面可以再转换为颜色
# saveTOMs:最耗费时间的计算,可存储起来供后续使用,
# mergeCutHeight: 合并模块的阈值,越大模块越少,一般为0.25
# minModuleSize: 每个模块里最少放多少个基因,设定越大模块越少
# 输出结果根据模块中基因数目的多少,降序排列,依次编号为 `1-最大模块数`。
# **0 (grey)**表示**未**分入任何模块的基因。
}
## 模块可视化,层级聚类树展示各个模块
if(T){
# Convert labels to colors for plotting
moduleColors <- labels2colors(net$colors)
table(moduleColors)
# Plot the dendrogram and the module colors underneath
pdf("step3_genes-modules_ClusterDendrogram.pdf",width = 16,height = 12)
plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
}
save(net, moduleColors, file = "step3_genes_modules.Rdata")
step3_genes-modules_ClusterDendrogram.pdf
分步法构建加权共表达网络,识别基因模块
与一步法类似,主要调整参数为minModuleSize和MEDissThres , 数值越大模块越少
##################### 分布法完成网络构建,一般不用 #################################
if(F){
## 构建加权共表达网络分为两步:
## 1. 计算邻近值,也是就是两个基因在不同样品中表达量的表达相关系数(pearson correlation rho),
## 2. 计算topology overlap similarity (TOM)。用TOM表示两个基因在网络结构上的相似性,即两个基因如果具有相似的邻近基因,这两个基因更倾向于有相互作用。
###(1)网络构建 Co-expression similarity and adjacency
adjacency = adjacency(datExpr, power = power)
###(2) 邻近矩阵到拓扑矩阵的转换,Turn adjacency into topological overlap
TOM = TOMsimilarity(adjacency)
dissTOM = 1-TOM
###(3) 聚类拓扑矩阵 Clustering using TOM
# Call the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "average");
# Plot the resulting clustering tree (dendrogram)
## 这个时候的geneTree与一步法的 net$dendrograms[[1]] 性质类似,但是还需要进行进一步处理
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04)
###(4) 聚类分支的修整 dynamicTreeCut
################# set the minimum module size ##############################
minModuleSize = 30
####
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
table(dynamicMods)
# Convert numeric lables into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
###(5) 聚类结果相似模块的融合 Merging of modules whose expression profiles are very similar
# Calculate eigengenes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
# Calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs)
# Cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average")
#一般选择 height cut 为0.25,对应于有75%相关性,进行融合
###################### set Merging height cut ################################
MEDissThres = 0.5
####
# Plot the result
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")
# Plot the cut line into the dendrogram
abline(h=MEDissThres, col = "red")
# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
# The merged module colors
mergedColors = merge$colors
# Eigengenes of the new merged modules:
mergedMEs = merge$newMEs
# 统计mergedmodule
table(mergedColors)
### (6) plot the gene dendrogram
pdf(file = "step3_stepbystep_DynamicTreeCut_genes-modules.pdf", width = 16,height = 12)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
### 保存数据
# Rename to moduleColors
moduleColors = mergedColors
# Construct numerical labels corresponding to the colors
colorOrder = c("grey", standardColors(50))
moduleLabels = match(moduleColors, colorOrder)-1
MEs = mergedMEs
# Save module colors and labels for use in subsequent parts
save(MEs, moduleLabels, moduleColors, geneTree,
file = "step3_stepByStep_genes_modules.Rdata")
}
step3_stepbystep_DynamicTreeCut_genes-modules.pdf
④ 关联基因模块与表型
绘制模块与表型的相关性热图、模块与表型的相关性boxplot图、基因与模块、表型的相关性散点图;
结合所得各类相关性结果,判断出哪些模块与表型是高度相关的,从而关联基因模块与表型
####################### 4.关联基因模块与表型 #####################################
rm(list = ls())
load(file = "step1_input.Rdata")
load(file = "step2_power_value.Rdata")
load(file = "step3_genes_modules.Rdata")
## 模块与表型的相关性热图
if(T){
datTraits$group <- as.factor(datTraits$group)
design <- model.matrix(~0+datTraits$group)
colnames(design) <- levels(datTraits$group) #get the group
MES0 <- moduleEigengenes(datExpr,moduleColors)$eigengenes #Calculate module eigengenes.
MEs <- orderMEs(MES0) #Put close eigenvectors next to each other
moduleTraitCor <- cor(MEs,design,use = "p")
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor,nSamples)
textMatrix <- paste0(signif(moduleTraitCor,2),"\n(",
signif(moduleTraitPvalue,1),")")
dim(textMatrix) <- dim(moduleTraitCor)
pdf("step4_Module-trait-relationship_heatmap.pdf",
width = 2*length(colnames(design)),
height = 0.6*length(names(MEs)) )
par(mar=c(5, 9, 3, 3)) #留白:下、左、上、右
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = colnames(design),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = F,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = F,
cex.text = 0.5,
zlim = c(-1,1),
main = "Module-trait relationships")
dev.off()
save(design, file = "step4_design.Rdata")
}
### 模块与表型的相关性boxplot图
if(T){
mes_group <- merge(MEs,datTraits,by="row.names")
library(gplots)
library(ggpubr)
library(grid)
library(gridExtra)
draw_ggboxplot <- function(data,Module="Module",group="group"){
ggboxplot(data,x=group, y=Module,
ylab = paste0(Module),
xlab = group,
fill = group,
palette = "jco",
#add="jitter",
legend = "") +stat_compare_means()
}
# 批量画boxplot
colorNames <- names(MEs)
pdf("step4_Module-trait-relationship_boxplot.pdf", width = 7.5,height = 1.6*ncol(MEs))
p <- lapply(colorNames,function(x) {
draw_ggboxplot(mes_group, Module = x, group = "group")
})
do.call(grid.arrange,c(p,ncol=2)) #排布为每行2个
dev.off()
}
### 基因与模块、表型的相关性散点图
#所有的模块都可以跟基因算出相关系数,所有的连续型性状也可以跟基因算出相关系数,
#如果跟性状显著相关的基因也跟某个模块显著相关,那么这些基因可能就非常重要。
# 选择离散性状的表型
levels(datTraits$group)
choose_group <- "primed"
if(T){
modNames <- substring(names(MEs), 3)
### 计算模块与基因的相关性矩阵
## Module Membership: 模块内基因表达与模块特征值的相关性
geneModuleMembership <- as.data.frame(cor(datExpr, MEs, use = "p"))
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
names(geneModuleMembership) <- paste0("MM", modNames)
names(MMPvalue) <- paste0("p.MM", modNames)
### 计算性状与基因的相关性矩阵
## Gene significance,GS:比较样本某个基因与对应表型的相关性
## 连续型性状
# trait <- datTraits$groupNo
## 非连续型性状,需转为0-1矩阵, 已存于design中
trait <- as.data.frame(design[,choose_group])
geneTraitSignificance <- as.data.frame(cor(datExpr,trait,use = "p"))
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
names(geneTraitSignificance) <- paste0("GS")
names(GSPvalue) <- paste0("GS")
### 可视化基因与模块、表型的相关性.
#selectModule<-c("blue","green","purple","grey") ##可以选择自己想要的模块
selectModule <- modNames ## 全部模块批量作图
pdf("step4_gene-Module-trait-significance.pdf",width=7, height=1.5*ncol(MEs))
par(mfrow=c(ceiling(length(selectModule)/2),2)) #批量作图开始
for(module in selectModule){
column <- match(module,selectModule)
print(module)
moduleGenes <- moduleColors==module
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for trait",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
}
dev.off()
}
step4_Module-trait-relationship_heatmap.pdf
step4_Module-trait-relationship_boxplot.pdf
step4_gene-Module-trait-significance.pdf
⑤ WGCNA可视化:TOMplot 、Eigengene-adjacency-heatmap
WGCNA的标配热图TOMplot / Network heapmap plot,描绘了分析中所有基因之间的拓扑重叠矩阵(TOM),颜色越深表示基因之间的相关性越大;
Eigengene-adjacency-heatmap 展示基因模块之间的相关性,还可以加入表型信息,查看表型与哪些模块相关性高
######################### 5. WGCNA可视化:TOMplot Eigengene-adjacency-heatmap ##################################
rm(list = ls())
load(file = 'step1_input.Rdata')
load(file = "step2_power_value.Rdata")
load(file = "step3_genes_modules.Rdata")
load(file = "step4_design.Rdata")
if(T){
TOM=TOMsimilarityFromExpr(datExpr,power=power)
dissTOM=1-TOM
## draw all genes
if(T){
geneTree = net$dendrograms[[1]]
plotTOM = dissTOM^7
diag(plotTOM)=NA
png("step5_TOMplot_Network-heatmap.png",width = 800, height=600)
TOMplot(plotTOM,geneTree,moduleColors,
col=gplots::colorpanel(250,'red',"orange",'lemonchiffon'),
main="Network heapmap plot")
dev.off()
}
### draw selected genes to save time...just for test...
if(F){
nSelect =0.1*nGenes
set.seed(123)
select=sample(nGenes,size = nSelect)
selectTOM = dissTOM[select,select]
selectTree = hclust(as.dist(selectTOM),method = "average")
selectColors = moduleColors[select]
plotDiss=selectTOM^7
diag(plotDiss)=NA
pdf("step5_select_TOMplot_Network-heatmap.pdf",width=8, height=6)
TOMplot(plotDiss,selectTree,selectColors,
col=gplots::colorpanel(250,'red',"orange",'lemonchiffon'),
main="Network heapmap plot of selected gene")
dev.off()
}
}
### 模块相关性展示 Eigengene-adjacency-heatmap
if(T){
MEs = moduleEigengenes(datExpr,moduleColors)$eigengenes
MET = orderMEs(MEs)
# 若添加表型数据
if(T){
## 连续型性状
# MET = orderMEs(cbind(MEs,datTraits$groupNo))
## 非连续型性状,需将是否属于这个表型进行0,1数值化,已存于design中
design
primed = as.data.frame(design[,3])
names(primed) = "primed"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, primed))
}
pdf("step5_module_cor_Eigengene-dendrogram.pdf",width = 8,height = 10)
plotEigengeneNetworks(MET, setLabels="",
marDendro = c(0,4,1,4), # 留白:下右上左
marHeatmap = c(5,5,1,2), # 留白:下右上左
cex.lab = 0.8,
xLabelsAngle = 90)
dev.off()
}
step5_select_TOMplot_Network-heatmap.pdf
step5_module_cor_Eigengene-dendrogram.pdf
⑥ 对感兴趣模块的基因进行GO分析
这里选取了与表型相关度较高的"black","pink","royalblue"模块进行GO分析
#################### 6. 选择感兴趣基因模块进行GO分析 ####################
rm(list = ls())
load(file = 'step1_input.Rdata')
load(file = "step2_power_value.Rdata")
load(file = "step3_genes_modules.Rdata")
load(file = "step4_design.Rdata")
### 条件设置
OrgDb = "org.Mm.eg.db" # "org.Mm.eg.db" "org.Hs.eg.db"
genetype = "SYMBOL" # "SYMBOL" "ENSEMBL"
table(moduleColors)
choose_module <- c("black","pink","royalblue")
if(T){
library(clusterProfiler)
library(org.Mm.eg.db)
library(org.Hs.eg.db)
gene_module <- data.frame(gene=colnames(datExpr),
module=moduleColors)
write.csv(gene_module,file = "step6_gene_moduleColors.csv",row.names = F, quote = F)
tmp <- bitr(gene_module$gene,fromType = genetype, # "SYMBOL" "ENSEMBL"
toType = "ENTREZID",
OrgDb = OrgDb )
gene_module_entrz <- merge(tmp,gene_module, by.x=genetype, by.y="gene")
choose_gene_module_entrz <- gene_module_entrz[gene_module_entrz$module %in% choose_module,]
###run go analysis
formula_res <- compareCluster(
ENTREZID~module,
data = choose_gene_module_entrz,
fun = "enrichGO",
OrgDb = OrgDb,
ont = "BP", #One of "BP", "MF", and "CC" or "ALL"
pAdjustMethod = "BH",
pvalueCutoff = 0.25,
qvalueCutoff = 0.25
)
###精简GO富集的结果,去冗余
lineage1_ego <- simplify(
formula_res,
cutoff=0.5,
by="p.adjust",
select_fun=min
)
save(gene_module, formula_res, lineage1_ego, file="step6_module_GO_term.Rdata")
write.csv(lineage1_ego@compareClusterResult,
file="step6_module_GO_term.csv")
### 绘制dotplot图
dotp <- dotplot(lineage1_ego,
showCategory=10,
includeAll = TRUE, #将有overlap的结果也展示出来
label_format=90)
ggsave(dotp,filename= "step6_module_GO_term.pdf", #device = cairo_pdf,
width = 12,
height = 15)
}
step6_module_GO_term.pdf
⑦ 感兴趣基因模块绘制热图
这里选取了与primed表型相关度很高的royalblue模块绘制热图
############################### 7.感兴趣基因模块绘制热图 ######################################
rm(list = ls())
load(file = 'step1_input.Rdata')
load(file = "step3_genes_modules.Rdata")
table(moduleColors)
module = "royalblue"
### 感兴趣模块画热图
if(T){
dat=datExpr[,moduleColors==module]
library(pheatmap)
n=t(scale(dat)) #对基因做scale,并转置表达矩阵为行为基因、列为样本形式
# n[n>2]=2
# n[n< -2]= -2
# n[1:4,1:4]
group_list=datTraits$group
ac=data.frame(g=group_list)
rownames(ac)=colnames(n)
pheatmap::pheatmap(n,
fontsize = 8,
show_colnames =T,
show_rownames = F,
cluster_cols = T,
annotation_col =ac,
width = 8,
height = 6,
angle_col=45,
main = paste0("module_",module,"-gene heatmap"),
filename = paste0("step7_module_",module,"_Gene-heatmap.pdf"))
}
step7_module_royalblue_Gene-heatmap.pdf
⑧ 导出基因至 VisANT 或 cytoscape
将感兴趣模块 "royalblue"的基因导出 VisANT或cytoscape,cytoscape作图请参阅RNA-seq入门实战(十):PPI蛋白互作网络构建(下)——Cytoscape软件的使用
################### 8.感兴趣模块基因导出 VisANT or cytoscape ######################
rm(list = ls())
load(file = 'step1_input.Rdata')
load(file = "step2_power_value.Rdata")
load(file = "step3_genes_modules.Rdata")
module = "royalblue"
if(T){
### 提取感兴趣模块基因名
gene <- colnames(datExpr)
inModule <- moduleColors==module
modgene <- gene[inModule]
### 模块对应的基因关系矩阵
TOM <- TOMsimilarityFromExpr(datExpr,power=power)
modTOM <- TOM[inModule,inModule]
dimnames(modTOM) <- list(modgene,modgene)
### 筛选连接度最大的top100基因
nTop = 100
IMConn = softConnectivity(datExpr[, modgene]) #计算连接度
top = (rank(-IMConn) <= nTop) #选取连接度最大的top100
filter_modTOM <- modTOM[top, top]
# for visANT
vis <- exportNetworkToVisANT(filter_modTOM,
file = paste("step8_visANTinput-",module,".txt",sep = ""),
weighted = T,threshold = 0)
# for cytoscape
cyt <- exportNetworkToCytoscape(filter_modTOM,
edgeFile = paste("step8_CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
nodeFile = paste("step8_CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.15, #weighted权重筛选阈值,可调整
nodeNames = modgene[top],
nodeAttr = moduleColors[inModule][top])
}
将文件导入cytoscape后,可作图如下:
network connection in royalblue module
参考资料
WGCNA官方说明文档:WGCNA: R package for performing Weighted Gene Co-expression Network Analysis (ucla.edu)
WGCNA文章链接:
算法—— A general framework for weighted gene co-expression network analysis - PubMed (nih.gov)
R包——WGCNA: an R package for weighted correlation network analysis - PMC (nih.gov)
官方英文教程:
GBMTutorialHorvath.doc (ucla.edu)
YeastTutorialHorvath.doc (ucla.edu)
FemaleLiver-02-networkConstr-man.pdf (ucla.edu)(含分步法构建加权共表达网络)
注意事项:
WGCNA package: Frequently Asked Questions (ucla.edu)
优秀的中文教程:GEO/task4-NPC at master · jmzeng1314/GEO · GitHub
一文看懂WGCNA 分析(2019更新版)手把手10分文章WGCNA复现:小胶质细胞亚群在脑发育时髓鞘形成的作用
GitHub - jmzeng1314/my_WGCNA
WGCNA分析,简单全面的最新教程