基因表达聚类分析之初探SOM
之前的培训有老师提出要做SOM分析,最后卡在code plot
只能出segment plot
却出不来line plot
。查了下,没看到解决方案。今天看了下源码,设置了一个参数,得到趋势图。也顺便学习了SOM分析的整个过程,整理下来,以备以后用到。
更多聚类相关见:
SOM分析基本理论
SOM (Self-Organizing Feature Map,自组织特征图)是基于神经网络方式的数据矩阵和可视化方式。与其它类型的中心点聚类算法如K-means等相似,SOM也是找到一组中心点 (又称为codebook vector),然后根据最相似原则把数据集的每个对象映射到对应的中心点。在神经网络术语中,每个神经元对应于一个中心点。
与K-means类似,数据集中的每个对象每次处理一个,判断最近的中心点,然后更新中心点。与K-means不同的是,SOM中中心点之间存在拓扑形状顺序,在更新一个中心点的同时,邻近的中心点也会随着更新,直到达到设定的阈值或中心点不再有显著变化。最终获得一系列的中心点 (codes)隐式地定义多个簇,与这个中心点最近的对象归为同一个簇。
SOM强调簇中心点之间的邻近关系,相邻的簇之间相关性更强,更有利于解释结果,常用于可视化网络数据或基因表达数据。
2: repeat
3: Select the next object.
4: Determine the closest centroid to the object.
5: Update this centroid and the centroids that are close, i.e., in a specified neighborhood.
6: until The centroids don't change much or a threshold is exceeded.
7: Assign each object to its closest centroid and return the centroids and clusters.
SOM分析实战
下面是R中用kohonen
包进行基因表达数据的SOM分析。
加载或安装包
### LOAD LIBRARIES - install with:
#install.packages(c("kohonen")
library(kohonen)
读入数据并进行标准化
data <- read.table("ehbio_trans.Count_matrix.xls", row.names=1, header=T, sep="\t")
# now train the SOM using the Kohonen method
# 标准化数据
data_train_matrix <- as.matrix(t(scale(t(data))))
names(data_train_matrix) <- names(data)
head(data_train_matrix)
untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011 trt_N61311
ENSG00000223972 1.6201852 -0.5400617 -0.5400617 -0.5400617 -0.5400617
ENSG00000227232 -1.0711639 1.0274429 0.6776751 0.8525590 -1.2460478
ENSG00000278267 -1.6476479 1.3480756 0.1497862 0.7489309 -0.4493585
ENSG00000237613 2.4748737 -0.3535534 -0.3535534 -0.3535534 -0.3535534
ENSG00000238009 -0.3535534 -0.3535534 -0.3535534 -0.3535534 2.4748737
ENSG00000268903 -0.7020086 0.9025825 -0.7020086 -0.7020086 -0.7020086
trt_N052611 trt_N080611 trt_N061011
ENSG00000223972 1.6201852 -0.5400617 -0.5400617
ENSG00000227232 -1.2460478 0.5027912 0.5027912
ENSG00000278267 0.7489309 0.1497862 -1.0485032
ENSG00000237613 -0.3535534 -0.3535534 -0.3535534
ENSG00000238009 -0.3535534 -0.3535534 -0.3535534
ENSG00000268903 0.9025825 -0.7020086 1.7048781
训练SOM模型
# 定义网络的大小和形状
som_grid <- somgrid(xdim = 10, ydim=10, topo="hexagonal")
# Train the SOM model!
som_model <- supersom(data_train_matrix, grid=som_grid, keep.data = TRUE)
可视化SOM结果
# 展示训练过程,距离随着迭代减少的趋势,判断迭代是否足够;最后趋于平稳比较好
plot(som_model, type = "changes")
计量每个SOM中心点包含的基因的数目
## custom palette as per kohonen package (not compulsory)
coolBlueHotRed <- function(n, alpha = 0.7) {
rainbow(n, end=4/6, alpha=alpha)[n:1]
}
# shows the number of objects mapped to the individual units.
# Empty units are depicted in gray.
plot(som_model, type = "counts", main="Node Counts", palette.name=coolBlueHotRed)
计量SOM中心点的内敛性和质量
# map quality
# shows the mean distance of objects mapped to a unit to
# the codebook vector of that unit.
# The smaller the distances, the better the objects are
# represented by the codebook vectors.
plot(som_model, type = "quality", main="Node Quality/Distance", palette.name=coolBlueHotRed)
邻居距离-查看潜在边界点
# 颜色越深表示与周边点差别越大,越是分界点
# neighbour distances
# shows the sum of the distances to all immediate neighbours.
# This kind of visualization is also known as a U-matrix plot.
# Units near a class boundary can be expected to have higher average distances to their neighbours.
# Only available for the "som" and "supersom" maps, for the moment.
plot(som_model, type="dist.neighbours", main = "SOM neighbour distances", palette.name=grey.colors)
查看SOM中心点的变化趋势
#code spread
plot(som_model, type = "codes", codeRendering="lines")
获取每个SOM中心点相关的基因
table(som_model$unit.classif)
# 只显示一部分
1 2 3 4 5 6
197 172 434 187 582 249
95 96 97 98 99 100
168 919 226 419 193 241
# code是从左至右,从下至上进行编号的
som_model_code_class = data.frame(name=rownames(data_train_matrix), code_class=som_model$unit.classif)
head(som_model_code_class)
name code_class
1 ENSG00000223972 81
2 ENSG00000227232 37
3 ENSG00000278267 93
4 ENSG00000237613 51
5 ENSG00000238009 11
6 ENSG00000268903 4
SOM结果进一步聚类
# 选择合适的聚类数目
# show the WCSS metric for kmeans for different clustering sizes.
# Can be used as a "rough" indicator of the ideal number of clusters
mydata <- as.matrix(as.data.frame(som_model$codes))
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
par(mar=c(5.1,4.1,4.1,2.1))
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)")
# Form clusters on grid
## use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(mydata)), 6)
# Colour palette definition
cluster_palette <- function(x, alpha = 0.6) {
n = length(unique(x)) * 2
rainbow(n, start=2/6, end=6/6, alpha=alpha)[seq(n,0,-2)]
}
cluster_palette_init = cluster_palette(som_cluster)
bgcol = cluster_palette_init[som_cluster]
#show the same plot with the codes instead of just colours
plot(som_model, type="codes", bgcol = bgcol, main = "Clusters", codeRendering="lines")
add.cluster.boundaries(som_model, som_cluster)
有一些类的模式不太明显,以后再看怎么优化。
SOM获取基因所在的新类
som_model_code_class_cluster = som_model_code_class
som_model_code_class_cluster$cluster = som_cluster[som_model_code_class$code_class]
head(som_model_code_class_cluster)
name code_class cluster
1 ENSG00000223972 81 2
2 ENSG00000227232 37 8
3 ENSG00000278267 93 8
4 ENSG00000237613 51 7
5 ENSG00000238009 11 4
6 ENSG00000268903 4 3
映射某个属性到SOM图
# 此处选择一个样本作为示例,可以关联很多信息,
# 比如基因通路,只要在矩阵后增加新的属性就可以。
color_by_var = names(data_train_matrix)[1]
color_by = data_train_matrix[,color_by_var]
unit_colors <- aggregate(color_by, by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)
plot(som_model, type = "property", property=unit_colors[,2], main=color_by_var, palette.name=coolBlueHotRed)
R统计和作图
随机森林randomForest 分类Classification 回归Regression
随机森林randomForest 分类Classification 回归Regression
R语言可视化学习笔记之ggridges包
万能转换:R图和统计表转成发表级的Word、PPT、Excel、HTML、Latex、矢量图等
那天空飘过的梅花月饼,是今年最好的中秋礼物
高颜值免费在线绘图
往期精品
后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集