21 天从入门到精通 R 语言｜第 4 天：使用 ggplot2 进行数据可视化

Original RStata RStata 2023-10-24

收录于合集

为了让大家更好的理解本文的内容，欢迎各位培训班会员参加明天晚上 9 点的直播课：「使用 ggplot2 进行数据可视化」

该课程是「R 语言数据科学的第 3 课时」，课程主页(点击文末的阅读原文即可跳转)：
https://rstata.duanshu.com/#/brief/course/229b770183e44fbbb64133df929818ec

这是《R 数据科学》系列课程的第三讲：数据可视化。本讲将会简要介绍 ggplot2 的基本用法。通过本次课的学习，你将能使用 ggplot2 绘制各种常见的统计图表。一起来学习吧！

之所以要学习 ggplot2，是因为它真的比 R 语言的基础绘图系统更容易使用且绘制出的图形更美观。ggplot2 基于 grammar of graphics 进行构图，大致可以理解为，使用 ggplot2 绘图的时候，图表是一层层绘制的。这种绘图方式的好处就在于你可以一层层的添加图层拓展你的图表。

准备工作

library(tidyverse)
library(ggplot2)

为了方便 Stata 用户学习，下面我将使用 auto.dta 数据演示 ggplot2 的使用：

haven::read_dta('auto.dta') -> auto
auto
#> # A tibble: 74 × 12
#>    make      price   mpg rep78 headr…¹ trunk weight length  turn displ…² gear_…³
#>    <chr>     <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>
#>  1 AMC Conc…  4099    22     3     2.5    11   2930    186    40     121    3.58
#>  2 AMC Pacer  4749    17     3     3      11   3350    173    40     258    2.53
#>  3 AMC Spir…  3799    22    NA     3      12   2640    168    35     121    3.08
#>  4 Buick Ce…  4816    20     3     4.5    16   3250    196    40     196    2.93
#>  5 Buick El…  7827    15     4     4      20   4080    222    43     350    2.41
#>  6 Buick Le…  5788    18     3     4      21   3670    218    43     231    2.73
#>  7 Buick Op…  4453    26    NA     3      10   2230    170    34     304    2.87
#>  8 Buick Re…  5189    20     3     2      16   3280    200    42     196    2.93
#>  9 Buick Ri… 10372    16     3     3.5    17   3880    207    43     231    2.93
#> 10 Buick Sk…  4082    19     3     3.5    13   3400    200    42     231    3.08
#> # … with 64 more rows, 1 more variable: foreign <dbl+lbl>, and abbreviated
#> #   variable names ¹headroom, ²displacement, ³gear_ratio


attributes(auto$make)
#> $label
#> [1] "Make and Model"
#> 
#> $format.stata
#> [1] "%-18s"

创建一个 ggplot2 图表

ggplot(data = auto) + 
  geom_point(mapping = aes(x = weight, y = price))

美学映射

上面的示例仅仅创建了 x 和 y 的映射，实际上我们可以创建样式丰富的映射，例如说，我们使用散点的大小来表示另外一个变量 rep78，这个变量是 1978 年该汽车的维修次数：

ggplot(data = auto) + 
  geom_point(mapping = aes(x = weight, y = price, size = rep78))

使用散点颜色也行：

ggplot(data = auto) + 
  geom_point(mapping = aes(x = weight, y = price, color = rep78))

更多映射的设定可以阅读 ?geom_point 查看。

除了映射之外，我们还可以对图层的某个属性值进行设定，例如设定散点的颜色为 “blue”：

ggplot(data = auto) + 
  geom_point(mapping = aes(x = weight, y = price), color = "blue")

ggplot2 支持样式丰富的散点，我们可以画个图展示下：

1:5 %>% 
  crossing(1:5) %>% 
  set_names(c("x", "y")) %>% 
  mutate(z = as.numeric(row.names(.))) %>% 
  ggplot(aes(x, y, shape = I(z))) + 
  geom_point(size = 5) + 
  geom_label(aes(label = z, y = y + 0.5)) + 
  theme(axis.text = element_blank(),
        axis.title = element_blank())

思考：

下面的代码绘制的散点为什么不是蓝色的？

ggplot(data = auto) + 
  geom_point(mapping = aes(x = weight, y = price, color = "blue"))

stroke 对应散点的什么属性？

ggplot(data = subset(auto, !is.na(rep78))) + 
  geom_point(mapping = aes(x = weight, y = price, stroke = rep78), shape = 1)

可以比较它和 size 参数：

ggplot(data = subset(auto, !is.na(rep78))) + 
  geom_point(mapping = aes(x = weight, y = price, size = rep78), shape = 1)

分面

分面也就是说我们使用一个数据集，根据某个变量进行拆分，每个子数据集绘制一个小图再组合，通常使用 facet_wrap() 和 facet_grid() 实现。

facet_wrap:

ggplot(auto) + 
  geom_point(aes(weight, price), shape = 1) +
  facet_wrap(~foreign, nrow = 1)

auto %>% 
 mutate(foreign = factor(foreign, 
                         levels = 0:1,
                         labels = c("国产车", "进口车"))) %>% 
  ggplot() + 
  geom_point(aes(weight, price), shape = 1) +
  facet_wrap(~foreign, nrow = 1)

还可以多个变量进行组合：

auto %>% 
 mutate(foreign = factor(foreign, 
                         levels = 0:1,
                         labels = c("国产车", "进口车"))) %>% 
  ggplot() + 
  geom_point(aes(weight, price), shape = 1) +
  facet_wrap(foreign ~ rep78, nrow = 4)

因为 facet_wrap 在创建分面图的时候各个图之间都要有空隙，所以就导致了上面的结果，对于这种情况，可以使用 facet_grid:

facet_grid:

auto %>% 
 mutate(foreign = factor(foreign, 
                         levels = 0:1,
                         labels = c("国产车", "进口车"))) %>% 
  ggplot() + 
  geom_point(aes(weight, price), shape = 1) +
  facet_grid(foreign ~ rep78)

多种图层

ggplot(auto) + 
  geom_point(aes(weight, price), shape = 1) + 
  geom_smooth(aes(weight, price))

实际上如果多个图层中有共同的参数，可以把这些参数放置在 ggplot() 里面：

ggplot(auto, aes(weight, price)) + 
  geom_point(shape = 1) + 
  geom_smooth()

还可以设定更多的映射：

ggplot(auto, aes(weight, price)) + 
  geom_point(shape = 1) + 
  geom_smooth(aes(linetype = factor(foreign),
                  color = factor(foreign)))

比较下面三个图：

library(patchwork)
ggplot(auto, aes(weight, price)) + 
  geom_point(shape = 1) + 
  geom_smooth() -> p1
ggplot(auto, aes(weight, price)) + 
  geom_point(shape = 1) + 
  geom_smooth(aes(group = factor(foreign))) -> p2
ggplot(auto, aes(weight, price)) + 
  geom_point(shape = 1) + 
  geom_smooth(aes(color = factor(foreign)),
              show.legend = FALSE) -> p3
p1 + p2 + p3

不同的图层也可以使用不同的数据：

ggplot(auto, aes(weight, price)) + 
  geom_point(shape = 1) + 
  geom_smooth(data = subset(auto, foreign == 1), se = FALSE)

思考：

线图、箱线图、直方图、区域图如何绘制？

ggplot(auto, aes(weight, price)) + 
  geom_line()

ggplot(auto, aes(factor(rep78), price)) + 
  geom_boxplot()

ggplot(auto, aes(price)) + 
  geom_histogram()

ggplot(auto, aes(weight, price)) + 
  geom_area()

观察前面教程中的图，使用 auto 数据集中的其它变量绘制类似的试试。

统计变换

我觉得统计变换是 ggplot2 中最让人费解的了。大家可以尽可能的掌握，掌握不了也没有任何问题，因为我们也可以先提前计算好统计量再绘图，例如：

ggplot(auto) + 
  geom_bar(aes(x = rep78))

这里面就涉及到了统计变换（很好理解，就是图中的数据并不是原始数据集中的数据，而是经过计算得到的）。如果我们没有掌握统计变换，我们可以怎么做呢？

auto %>% 
  count(rep78) %>% 
  ggplot() + 
  geom_bar(aes(x = rep78, y = n), stat = 'identity')

是不是结果一样。但是统计变换的优势也表现出来了，简洁、快速，仿佛有魔力。

实际上上面的 geom_bar 代码的完整版是这样的：

ggplot(auto) + 
  geom_bar(aes(x = rep78, y = ..count..), stat = "count")

展示各个组的比例可以这样：

ggplot(auto) + 
  geom_bar(aes(x = rep78, y = ..prop..))

或者这样：

ggplot(auto) + 
  geom_bar(aes(x = rep78, y = stat(prop)))

上面的代码中我们使用的是 geom，然后辅助使用统计变换，如果我们更多使用的是统计变换，用下面的样式可能更好：

ggplot(auto) + 
  stat_summary(
    mapping = aes(rep78, price),
    fun.min = min,
    fun.max = max,
    fun = median,
    geom = "pointrange"
  )

位置调整

我们通常见到的柱形图有下面三种：

ggplot(auto) + 
  geom_bar(aes(x = rep78, color = factor(rep78), fill = factor(rep78)))

ggplot(auto) + 
  geom_bar(aes(x = rep78, color = factor(foreign), fill = factor(foreign)))

ggplot(auto) + 
  geom_bar(aes(x = rep78, color = factor(foreign), fill = factor(foreign)), 
           position = position_identity())

ggplot(auto) + 
  geom_bar(aes(x = rep78, color = factor(foreign), fill = factor(foreign)), 
           position = position_dodge())

ggplot(auto) + 
  geom_bar(aes(x = rep78, color = factor(foreign), fill = factor(foreign)), 
           position = position_fill())

还有很多种位置调整，例如我们可以使用 position_jitter 来缓解散点图中的重叠问题：

ggplot(auto) + 
  geom_point(aes(x = weight, y = price), position = position_jitter())

坐标系

最常用的就是笛卡尔坐标系了，还有下面几种常用的坐标系：

coord_flip(): 横纵轴调换；

ggplot(auto, aes(factor(rep78), price)) + 
  geom_boxplot() + 
  coord_flip()

coord_fixed(): 设定横纵轴的比例。

ggplot(auto, aes(mpg, trunk)) + 
  geom_point() + 
  geom_abline() + 
  coord_fixed(ratio = 1)

coord_sf(): 地理坐标系

例如我们绘制一幅展示 6 月 10 日全球新冠疫情确诊病例的分布图：

library(sf)
library(ggplot2)
library(tidyverse)
wdmp <- read_sf('world_high_resolution_mill.geo.json')
df <- haven::read_dta('world-covid19.dta')
wdmp %>% 
  left_join(df, by = c("code" = "iso")) %>% 
  subset(date == "2020-06-10") %>% 
  ggplot() + 
  geom_sf(aes(fill = confirmed), color = "white", size = 0.01) + 
  coord_sf() + 
  guides(fill = guide_legend())

也可以绘制好看点（未来会细致讲述）：

wdmp %>% 
  left_join(df, by = c("code" = "iso")) %>% 
  subset(date == "2020-06-10") %>% 
  ggplot() + 
  geom_sf(aes(fill = confirmed), color = "white", size = 0.01) + 
  scale_fill_viridis_c(option = 'A', direction = -1,
                       trans = 'log',
                       breaks = c(50, 1000, 20000, 500000)) + 
  # theme_ipsum(base_family = cnfont) + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()) + 
  guides(fill = guide_legend()) + 
  labs(fill = "确诊人数",
       title = "全球新冠疫情：2020-06-10",
       caption = "数据来源：约翰斯·霍普金斯大学")

coord_polar()：极坐标转换。

大家细心的话会发现上面没有讲解如何绘制饼图（而这是种非常常用的图表类型）。实际上 ggplot2 并没有专门设计绘制饼图的图层，饼图只不过是柱形图在极坐标下的样子：

count(auto, rep78) %>% 
  ggplot() + 
  geom_col(aes(x = 1, y = n,
               fill = rep78),
           position = position_fill()) + 
  coord_polar(theta = 'y')

另外我们还可以通过极坐标转换绘制玫瑰图：

ggplot(auto) + 
  geom_bar(aes(x = rep78,
               fill = factor(rep78), 
               color = factor(rep78)),
           width = 1) + 
  coord_polar()

理解 ggplot2 的构图过程

这部分就直播的时候直接演示了。

今天的内容就到这里了～

直播信息

为了让大家更好的理解上面的内容，欢迎各位培训班会员参加明天晚上 9 点的直播课：「使用 ggplot2 进行数据可视化」

直播地址：腾讯会议(需要报名 RStata 培训班参加)
讲义材料：需要报名 RStata 培训班，详情可阅读：一起来学习 R 语言和 Stata 啦！学习过程中遇到的问题也可以随时提问！

更多关于 RStata 会员的更多信息可添加微信号 r_stata 咨询：

泪目！8死17伤！江苏一职校持刀伤人案，背后隐情令人心惊！

突发！宜兴一学校发生持刀伤人案件！致8死17伤！太恶劣了！

一小学门口突发！多名学生被撞伤！

“占坑式辩护”，侵犯了谁？

突发！一小学门口发生撞人事件

21 天从入门到精通 R 语言｜第 4 天：使用 ggplot2 进行数据可视化

准备工作

创建一个 ggplot2 图表

美学映射

分面

多种图层

统计变换

位置调整

坐标系

理解 ggplot2 的构图过程

直播信息

您可能也对以下帖子感兴趣

泪目！8死17伤！江苏一职校持刀伤人案，背后隐情令人心惊！

突发！宜兴一学校发生持刀伤人案件！致8死17伤！太恶劣了！

一小学门口突发！多名学生被撞伤！

“占坑式辩护”，侵犯了谁？

突发！一小学门口发生撞人事件

生成图片，分享到微信朋友圈

21 天从入门到精通 R 语言｜第 4 天：使用 ggplot2 进行数据可视化

准备工作

创建一个 ggplot2 图表

美学映射

分面

多种图层

统计变换

位置调整

坐标系

理解 ggplot2 的构图过程

直播信息

您可能也对以下帖子感兴趣