R 语言中绘制桑基图、和弦图及冲积图的方法汇总
该推文在平台上有视频讲解:https://rstata.duanshu.com/#/brief/course/18344c7675fd4450b260da5398896436 结合视频讲解学习效果更加!
培训班有个小伙伴非常想学习桑基图的画法,所以我今天就搜罗下 R 语言里面绘制桑基图的一些方法。最推荐的方法是使用 ggalluvial 绘制。其他的方法仅供参考,大家根据自己的喜好学习即可。当然绘制桑基图的方法还有很多,本文介绍的几种是我觉得比较好用的。
构造示例数据
首先我们需要构造一个示例数据集用于接下来的演示,这里我使用的是我的微信好友数据里面的省份、城市、性别变量。这个数据可以用下面的 Python 脚本获取:
import itchat
import pandas as pd
itchat.auto_login(hotReload = True)
friends = itchat.get_friends(update = True)
friends = pd.DataFrame(friends)
friends.to_csv("friends.csv")
如果你的微信无法通过这种方式导出好友数据,可以直接使用我的 friends.csv 数据集(已经被处理过了):
library(hrbrthemes)
library(tidyverse)
library(ggplot2)
library(magrittr) # 管道操作符
# 使用 friends.csv 演示
read_csv("friends.csv") -> df
df 是这样的:
是否记得这是我之前介绍的一个小技巧,为 datatable 表格控件添加下载按钮。
df %>%
DT::datatable(
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel',
'pdf', 'print'),
lengthMenu = list(c(10, 25, 50 ,-1),
c(10, 25, 50, "All"))))
使用 sankeywheel 绘制
需要注意,不要在同一个 Rmd 文档中使用 sankeywheel 包和 highcharter 包,否则会出现图表不显示的问题。
sankeywheel 包可以用来绘制桑基图,使用起来非常简单。这个包是基于 highcharts 和 htmlwidgets 构建的,不过后来由于 highcharter 包也支持桑基图和和弦图的绘制(后面也有介绍)了,所以就不再推荐使用这个包了。
# 不要安装 CRAN 上的,使用下面的方式安装
# devtools::install_local("sankeywheel_0.1.0.tar.gz")
library(sankeywheel)
df %>%
group_by(prov, gender) %>%
count() %>%
ungroup() -> df_count
df_count
#> # A tibble: 61 × 3
#> prov gender n
#> <chr> <chr> <int>
#> 1 上海 女孩 24
#> 2 上海 未知 1
#> 3 上海 男孩 38
#> 4 云南 女孩 2
#> 5 云南 男孩 6
#> 6 内蒙古 女孩 4
#> 7 北京 女孩 31
#> 8 北京 男孩 48
#> 9 台湾 男孩 2
#> 10 吉林 女孩 2
#> # ℹ 51 more rows
sankeywheel(
from = df_count$prov, to = df_count$gender,
weight = df_count$n, type = "sankey",
title = "我的微信好友分布",
subtitle = "微信公众号 RStata",
seriesName = "",
width = "800px",
height = "600px"
)
但是这样是不是有点太“长”了?我们可以把省份分开成左右两部分:
df_count <- rbind(
df_count %>%
slice(1:25) %>%
`colnames<-`(c("from", "to", "n")),
df_count %>%
slice(26:61) %>%
select(gender, prov, n) %>%
`colnames<-`(c("from", "to", "n"))
)
sankeywheel(
from = df_count$from, to = df_count$to,
weight = df_count$n, type = "sankey",
title = "我的微信好友分布",
subtitle = "微信公众号 RStata",
seriesName = ""
)
这个包还有另外一个功能,就是它也可以绘制和弦图。是绘制桑基图还是和弦图是有 type 参数决定的,type 参数的默认值是 “dependencywheel”,也就是说默认绘制的就是和弦图,之所以这样设置,是因为我觉得这个单词不好写:
sankeywheel(
from = df_count$from, to = df_count$to,
weight = df_count$n,
title = "我的微信好友分布",
subtitle = "微信公众号 RStata",
seriesName = ""
)
组合多个 HTML 控件可以使用 manipulateWidget 包:
library(manipulateWidget)
combineWidgets(
sankeywheel(
from = df_count$from, to = df_count$to,
weight = df_count$n, type = "sankey",
title = "我的微信好友分布",
subtitle = "微信公众号 RStata",
seriesName = ""
),
sankeywheel(
from = df_count$from, to = df_count$to,
weight = df_count$n,
title = "我的微信好友分布",
subtitle = "微信公众号 RStata",
seriesName = ""
),
byrow = TRUE, ncol = 2, width = "100%", height = "400px"
)
使用 ggalluvial 绘制
这个方法就非常重要了,大家一定要掌握。
首先设置 ggplot2 绘图字体,song.otf 是附件中的字体文件:
library(ggplot2)
library(ggalluvial)
library(hrbrthemes)
# 设置字体(大家可以把字体换成自己电脑上的,详情可以参考 R 语言数据科学第一课)
library(showtext)
showtext_auto(enable = TRUE)
font_add("songti",
regular = "song.otf")
# 设置 ggplot2 绘图主题
theme_set(theme_ipsum(base_family = "songti"))
导入 ggalluvial 包,对 df 变量进行分组计数并把返回的结果保存到 pg 数据框里面:
pg <- df %>%
count(prov, city, gender)
然后就可以绘制一幅基于 ggplot2 的桑基图了:
ggplot(pg, aes(
axis1 = prov, axis2 = gender,
axis3 = city, y = n
), size = 0.001) +
geom_stratum(width = 0.5, alpha = 0.2, size = 0.1) +
geom_alluvium(aes(fill = gender), width = 0.5) +
scale_fill_manual(values = c(
"男孩" = "#019875",
"女孩" = "#E84A5F",
"未知" = "#2A363B"
)) +
geom_text(
stat = "stratum",
infer.label = TRUE,
family = "songti", size = 3.5,
color = "#2A2A2A"
) +
scale_x_continuous(
breaks = 1:3,
labels = c("省份", "性别", "城市")
) +
labs(
y = "人数", title = "我的微信好友分布",
subtitle = "微信公众号 RStata"
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank()
) -> p
p
我们可以通过下面的方式自定义 y 轴的标签:
# 修改 y 轴的标签
df %>%
count(prov) %>%
group_by(prov) %>%
summarise(value = sum(n)) %>%
pull(value) %>%
rev() %>%
cumsum() -> breaks
for (i in 2:length(breaks)) {
if (breaks[i] - breaks[i - 1] < 15) {
breaks[i - 1] <- NA
}
}
breaks <- breaks[!is.na(breaks)]
p +
scale_y_continuous(breaks = breaks)
大家一定注意到这个图存在很严重的标签重叠问题,有两个解决办法:
解决文本标签重合的方法 1: ggrepel::geom_text_repel
# 解决文本标签重合的方法 1: ggrepel::geom_text_repel
ggplot(pg, aes(
axis1 = prov, axis2 = gender,
axis3 = city, y = n
), size = 0.001) +
geom_stratum(width = 0.5, size = 0.1) +
geom_alluvium(aes(fill = gender), width = 0.5) +
scale_fill_manual(values = c(
"男孩" = "#019875",
"女孩" = "#E84A5F",
"未知" = "#2A363B"
)) +
ggrepel::geom_text_repel(
stat = "stratum",
infer.label = TRUE,
family = "songti", size = 3.5,
color = "#2A2A2A"
) +
scale_x_continuous(
breaks = 1:3,
labels = c("省份", "性别", "城市")
) +
labs(
y = "人数", title = "我的微信好友分布",
subtitle = "微信公众号 RStata"
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank()
) +
scale_y_continuous(breaks = breaks)
解决文本标签重合的方法 2: ggfittext::geom_fit_text
# 解决文本标签重合的方法 2: ggfittext::geom_fit_text
ggplot(pg, aes(
axis1 = prov, axis2 = gender,
axis3 = city, y = n
), size = 0.001) +
geom_stratum(width = 0.5) +
geom_alluvium(aes(fill = gender), width = 0.5) +
scale_fill_manual(values = c(
"男孩" = "#019875",
"女孩" = "#E84A5F",
"未知" = "#2A363B"
)) +
ggfittext::geom_fit_text(
stat = "stratum",
infer.label = TRUE,
family = "songti", min.size = 0.1,
color = "#2A2A2A"
) +
scale_x_continuous(
breaks = 1:3,
labels = c("省份", "性别", "城市")
) +
labs(
y = "人数", title = "我的微信好友分布",
subtitle = "微信公众号 RStata"
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.position = "none",
panel.grid.minor = element_blank()
) +
scale_y_continuous(breaks = breaks)
ggalluvial 包的详细用法可以参考作者给出的参考文档:
vignette("ggalluvial")
使用 alluvial 绘制
alluvial 包是基于基础绘图系统封装的,似乎不容易解决文本标签相互重叠的问题:
# 使用 alluvial 绘制
library(alluvial)
library(basetheme)
pars <- basetheme("default")
pars$family <- "songti"
basetheme(pars)
pg %>%
`colnames<-`(c("省份", "城市", "性别", "数量")) %>%
arrange(数量) -> pg
pg
#> # A tibble: 248 × 4
#> 省份 城市 性别 数量
#> <chr> <chr> <chr> <int>
#> 1 上海 嘉定 男孩 1
#> 2 上海 松江 女孩 1
#> 3 上海 虹口 女孩 1
#> 4 上海 虹口 男孩 1
#> 5 上海 长宁 男孩 1
#> 6 上海 闸北 男孩 1
#> 7 上海 青浦 女孩 1
#> 8 上海 青浦 男孩 1
#> 9 上海 静安 男孩 1
#> 10 上海 黄浦 未知 1
#> # ℹ 238 more rows
alluvial(pg[,1:3], freq = pg$`数量`,
col = ifelse(pg$`性别` == "男孩",
"#019875",
ifelse(pg$`性别` == "女孩",
"#E84A5F", "#2A363B")),
border = "grey",
alpha = 0.7)
# 导出成 pdf 文件
cairo_pdf(filename = "pic1_10.pdf", width = 10, height = 8)
alluvial(pg[,1:3], freq = pg$`数量`,
col = ifelse(pg$`性别` == "男孩",
"#019875",
ifelse(pg$`性别` == "女孩",
"#E84A5F", "#2A363B")),
border = "grey",
alpha = 0.7)
dev.off()
使用 echarts4r 绘制
这个也蛮好用的:
library(echarts4r)
df_count
#> # A tibble: 61 × 3
#> from to n
#> <chr> <chr> <int>
#> 1 上海 女孩 24
#> 2 上海 未知 1
#> 3 上海 男孩 38
#> 4 云南 女孩 2
#> 5 云南 男孩 6
#> 6 内蒙古 女孩 4
#> 7 北京 女孩 31
#> 8 北京 男孩 48
#> 9 台湾 男孩 2
#> 10 吉林 女孩 2
#> # ℹ 51 more rows
df_count %>%
e_charts(width = "100%", height = "400px") %>%
e_sankey(from, to, n) %>%
e_title("我的微信好友分布",
textStyle = list("fontSize" = 20,
"fontFamily" = "STSong"),
textAlign = "middle", left = "50%") %>%
e_title("微信公众号 RStata",
textStyle = list("fontSize" = 18,
"fontFamily" = "STSong"),
textAlign = "middle",
left = "50%", top = "8%") %>%
e_theme("infographic")
使用 highcharter 绘制
由于 highcharter 包支持绘制桑基图和和弦图了,所以不再建议大家使用上面的 sankeywheel 包了,这个包绘制桑基图和和弦图也很简单:
library(highcharter)
df %>%
select(gender, prov) %>%
data_to_sankey() -> df2
highchart() %>%
hc_chart(type = "sankey") %>%
hc_add_series(data = df2, name = "") %>%
hc_title(text = "我的微信好友分布") %>%
hc_subtitle(text = "微信公众号 RStata") %>%
hc_credits(text = "微信公众号 RStata",
enabled = T) %>%
hc_add_theme(hc_theme_sandsignika(chart = list(
divBackgroundImage = NULL,
style = list(background = "url(https://www.highcharts.com/samples/graphics/sand.png)",
fontFamily = "Source Han Serif")
)))
highchart() %>%
hc_chart(type = "dependencywheel") %>%
hc_add_series(data = df2, name = "") %>%
hc_title(text = "我的微信好友分布") %>%
hc_subtitle(text = "微信公众号 RStata") %>%
hc_credits(text = "微信公众号 RStata",
enabled = T) %>%
hc_add_theme(hc_theme_sandsignika(chart = list(
divBackgroundImage = NULL,
style = list(background = "url(https://www.highcharts.com/samples/graphics/sand.png)",
fontFamily = "Source Han Serif")
)))
使用 ggsankey 包绘制
ggsankey 是 GitHub 上的一个 R 包,安装方法如下:
devtools::install_github("davidsjoberg/ggsankey")
首先加载所需的 R 包和准备数据:
library(ggsankey)
library(dplyr)
library(ggplot2)
library(tidyverse)
read_csv("friends.csv") %>%
make_long(gender, prov) -> df
然后就可以绘制桑基图了:
ggplot(df, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node)) +
geom_sankey(aes(fill = factor(node)),
flow.alpha = 0.8,
node.color = "gray30") +
scale_fill_manual(values = c("#5050ff", "#ce3d32", "#749b58",
"#f0e685", "#466983", "#ba6338",
"#5db1dd", "#802268", "#6bd76b",
"#d595a7", "#924822", "#837b8d",
"#c75127", "#d58f5c", "#7a65a5",
"#e4af69", "#3b1b53", "#cddeb7",
"#612a79", "#ae1f63", "#e7c76f",
"#5a655e", "#cc9900", "#99cc00",
"#a9a9a9", "#cc9900", "#99cc00",
"#33cc00", "#00cc33", "#00cc99",
"#0099cc", "#0a47ff", "#4775ff",
"#ffc20a", "#ffd147")) +
geom_sankey_label(size = 3, color = "white",
family = "songti",
aes(label = factor(node),
fill = factor(node))) +
theme_sankey(base_size = 18, base_family = "songti") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0, size = 18,
margin = margin(b = 10),
family = "songti"),
plot.subtitle = element_text(hjust = 0,
size = 12,
margin = margin(b = 15),
family = "songti"),
plot.margin = margin(30, 30, 30, 30)) +
labs(x = "", title = "我的微信好友分布",
subtitle = "微信公众号 RStata") +
scale_x_discrete(labels = c("性别", "省份"),
expand = c(0, 0))
另外也可以绘制冲积图:
# geom_alluvial
read_csv("friends.csv") %>%
make_long(prov, gender, city) -> df
ggplot(df, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node)) +
geom_alluvial(aes(fill = factor(node)),
flow.alpha = 0.8,
node.color = "gray30") +
scale_fill_viridis_d() +
geom_alluvial_label(size = 3, color = "white",
family = "songti",
aes(label = factor(node),
fill = factor(node))) +
theme_sankey(base_size = 18, base_family = "songti") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0, size = 18,
margin = margin(b = 10),
family = "songti"),
plot.subtitle = element_text(hjust = 0,
size = 12,
margin = margin(b = 15),
family = "songti"),
plot.margin = margin(30, 30, 30, 30)) +
labs(x = "", title = "我的微信好友分布",
subtitle = "微信公众号 RStata") +
scale_x_discrete(labels = c("性别", "省份"),
expand = c(0, 0))
还有一种特殊的桑基图:
# install.packages("gapminder")
library(gapminder)
gapminder %>%
group_by(continent, year) %>%
summarise(gdp = (sum(pop * gdpPercap)/1e9) %>% round(0), .groups = "keep") %>%
ungroup() -> df
ggplot(df, aes(x = year,
node = continent,
fill = continent,
value = gdp)) +
geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 6) +
scale_fill_viridis_d(option = "A", alpha = .8) +
labs(x = NULL,
y = "GDP (百万美元)",
fill = NULL,
color = NULL) +
theme(legend.position = "bottom") +
labs(title = "每个大洲的 GDP 增长情况",
subtitle = "微信公众号 RStata")
使用 ggSankeyGrad 包绘制带渐变色的桑基图
ggSankeyGrad 包也是 GitHub 上的,安装方法如下:
devtools::install_github("ssp3nc3r/ggSankeyGrad", ref = "master")
ggSankeyGrad() 函数需要至少四个参数,c1,c2, col1, col2,所以我们先把数据准备好:
library(ggSankeyGrad)
read_csv("friends.csv") %>%
count(prov, gender) -> df
# 生成颜色变量
tibble(
prov = unique(df$prov),
col1 = c("#5050ff", "#ce3d32", "#749b58", "#f0e685", "#466983", "#ba6338",
"#5db1dd", "#802268", "#6bd76b", "#d595a7", "#924822", "#837b8d",
"#c75127", "#d58f5c", "#7a65a5", "#e4af69", "#3b1b53", "#cddeb7",
"#612a79", "#ae1f63", "#e7c76f", "#5a655e", "#cc9900", "#99cc00",
"#a9a9a9", "#cc9900", "#99cc00", "#33cc00", "#00cc33", "#00cc99",
"#0099cc", "#0a47ff")
) -> df1
tibble(
gender = unique(df$gender),
col2 = c("#f6cf71", "#019868", "#ec0b88")
) -> df2
df %>%
left_join(df1) %>%
left_join(df2) -> df
df
#> # A tibble: 61 × 5
#> prov gender n col1 col2
#> <chr> <chr> <int> <chr> <chr>
#> 1 上海 女孩 24 #5050ff #f6cf71
#> 2 上海 未知 1 #5050ff #019868
#> 3 上海 男孩 38 #5050ff #ec0b88
#> 4 云南 女孩 2 #ce3d32 #f6cf71
#> 5 云南 男孩 6 #ce3d32 #ec0b88
#> 6 内蒙古 女孩 4 #749b58 #f6cf71
#> 7 北京 女孩 31 #f0e685 #f6cf71
#> 8 北京 男孩 48 #f0e685 #ec0b88
#> 9 台湾 男孩 2 #466983 #ec0b88
#> 10 吉林 女孩 2 #ba6338 #f6cf71
#> # ℹ 51 more rows
然后就可以绘图了:
with(df, ggSankeyGrad(c1 = prov,
c2 = gender,
col1 = col1,
col2 = col2,
values = n,
label = TRUE)) +
theme_ipsum(base_family = "songti", grid = F) -> p
p
由于作者在编写这个函数的时候没有提供设置标签字体的函数,所以我们需要深入 p 的内部修改 label 的字体:
for (i in 2:36) {
p$layers[[i]]$aes_params$family <- "songti"
}
p +
scale_x_continuous(breaks = c(0, 1),
limits = c(-0.2, 1.2),
labels = c("省份", "性别")) +
theme(axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
labs(title = "我的微信好友分布",
subtitle = "微信公众号 RStata")
这样我们就把标签的字体更正成宋体了。
获取数据
附件下载链接(点击文末的阅读原文即可跳转):https://rstata.duanshu.com/#/brief/course/18344c7675fd4450b260da5398896436
是不是感觉很硬核!欢迎报名 RStata 培训班获取全部课程和以会员价获取数据资料(10元/份)详情可阅读这篇推文:数据处理、图表绘制、效率分析与计量经济学如何学习~
详情可点击阅读原文进入 RStata 学院了解(从首页的会员卡专区即可查看和购买会员卡)。
更多关于 RStata 培训班的信息可添加微信号 r_stata 咨询: