看代码学画图 - Nobel化学奖的获得者多半投胎在那些国家?
这是Antoine(一个博士生)做的一个可视化,用了我的ggimage
包(多次在本公众号介绍过),并且在twitter上AT了我,我看到小国旗都严重变形了,就给他来个PR,修复一下。讲真ggimage
用图片来当画图元素,可以说图片是不会变形的,只要你用coord_equal
或者coord_fixed
,但在实际中,我们不会用这两个,并且图片长宽比会根据我们的需要而设定,这样子用于画图的图片等同于就被拉扯了,于是长宽比就不是原来的,而图片就变形了,如同上面的小国旗。所以我在ggimage
里提供了一个参数,asp
,顾名思义用来设定aspect ratio,于是你可以拯救变形的小图片。最后我还给了它另外一个参数,我们拭目以待,看他最后将给我们带来什么样的应用。
这里我给出他的代码的解读,帮助大家理解。
library(LaCroixColoR)
library(tidyverse)
library(ggthemes)
library(ggimage)
LaCroixColoR
这个包是用来配色,library(ggthemes)
包用来设主题,用了华尔街日报的主题theme_wsj
,我的ggimage
包用来画小国旗。tidyverse
当然是用来处理数据。
数据
数据在 https://github.com/abichat/tidytuesday/.
nobel_winners <- read_csv("data/data_2019-05-14.csv",
col_types = "dccccdccDccccccDcc")
大概长这样子:
> head(nobel_winners)
# A tibble: 6 x 18
prize_year category prize motivation prize_share laureate_id laureate_type
<dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1901 Chemist~ The ~ "\"in rec~ 1/1 160 Individual
2 1901 Literat~ The ~ "\"in spe~ 1/1 569 Individual
3 1901 Medicine The ~ "\"for hi~ 1/1 293 Individual
4 1901 Peace The ~ <NA> 1/2 462 Individual
5 1901 Peace The ~ <NA> 1/2 463 Individual
6 1901 Physics The ~ "\"in rec~ 1/1 1 Individual
# ... with 11 more variables: full_name <chr>, birth_date <date>,
# birth_city <chr>, birth_country <chr>, gender <chr>,
# organization_name <chr>, organization_city <chr>,
# organization_country <chr>, death_date <date>, death_city <chr>,
# death_country <chr>
画小国旗用的不是国家的全称,而是代号,所以这里生成一个对应的代号,用了tribble
这个可以很方便地按照row来输入一个tibble
,前面用~
开头的是column name,想想以前要按照row来输入的话,还挺麻烦,因为data.frame是column based的。
df_countrycode <-
tribble(~country, ~code,
"United States of America", "US",
"Germany", "DE",
"United Kingdom", "GB",
"France", "FR",
"Japan", "JP",
"Netherlands", "NL",
"Sweden", "SE",
"Russia", "RU",
"Canada", "CA",
"Austria", "AT")
表格
countries <-
nobel_winners %>%
filter(category == "Chemistry") %>%
count(birth_country, sort = TRUE) %>%
head(n = 10) %>%
pull(birth_country)
这里只看化学奖,于是过滤一下,按照出生的国家进行计数并排序,只拿人数最多的10个国家,用pull
获取出生国家,而不管计数,出来结果如下。PS,pull
相当于[[
去获取某个column。
> countries
[ ] "United States of America" "Germany"
[ ] "United Kingdom" "France"
[ ] "Japan" "Netherlands"
[ ] "Sweden" "Austria"
[ ] "Canada" "Russia"
获奖者数目:
nobel_counts <-
nobel_winners %>%
filter(birth_country %in% countries, category == "Chemistry") %>%
select(prize_year, birth_country) %>%
arrange(prize_year) %>%
group_by(birth_country) %>%
mutate(n_prize = n(),
first_prize = min(prize_year),
last_prize = max(prize_year),
cum = row_number()) %>%
ungroup()
从出生国家中过滤出上面排名前10的国家,并且只选择化学奖,只保留获奖年份和出生国家这两个信息,按照获奖年份排序,按照出生国家分组,用mutate
加一些相当的column,n_prize是每个国家获奖的次数,first_prize是第一次获奖的年份,last_prize是最后一次获奖的年份,cum是所有获奖次数的累积加和,也就是每一个分组之后排序的row number。
> nobel_counts
nobel_counts
# A tibble: 151 x 6
prize_year birth_country n_prize first_prize last_prize cum
<dbl> <fct> <int> <dbl> <dbl> <dbl>
1 1901 Netherlands 5 1901 2016 0
2 1901 Netherlands 5 1901 2016 1
3 1903 Sweden 5 1903 2015 0
4 1903 Sweden 5 1903 2015 1
5 1906 France 10 1906 2016 0
6 1906 France 10 1906 2016 1
7 1912 France 10 1906 2016 2
8 1912 France 10 1906 2016 3
9 1913 France 10 1906 2016 4
10 1914 United States of America 57 1914 2015 0
# ... with 141 more rows
对上面的数据再做处理:
nobel_counts <-
nobel_counts %>%
filter(cum == 1) %>%
mutate(cum = 0) %>%
bind_rows(nobel_counts) %>%
arrange(prize_year, cum) %>%
mutate(birth_country = fct_reorder(birth_country, n_prize, .desc = TRUE))
把cum==1,也就是首次获奖的记录,记为cum = 0,再用bind_rows
添加到原来的数据集上,按照prize_year
和cum
排序,按照获奖次数设定出生国家的factor levels。
first_last_nobel <-
nobel_counts %>%
select(birth_country, n_prize, first_prize, last_prize) %>%
unique() %>%
left_join(df_countrycode, by = c("birth_country" = "country"))
再整理一个首次和最后下次获奖的记录,这只是为了画图额外增加的,在最后的点上画小国旗,而在第一个点画上点,而这个第一个点,其实不是第一个点,而是y=0,做为所以线条的起点,这也是前面加cum=0的记录的原因,让线条把0和首次获奖1之前不中断。
画图
ggplot(nobel_counts) +
aes(x = prize_year, y = cum, group = birth_country) +
geom_line(aes(color = birth_country)) +
geom_point(data = first_last_nobel, y = 0,
aes(x = first_prize, color = birth_country)) +
geom_flag(data = first_last_nobel, size = 0.03, asp=2,
aes(x = last_prize, y = n_prize, image = code)) +
scale_color_manual(values = lacroix_palette("PeachPear", n = 10, type = "continuous")) +
scale_y_continuous(limits = c(NA, 60)) +
labs(title = "Number of chemistry Nobel prizes by birth country", color = NULL,
caption = "Source: The Nobel Prize\n@_abichat for #TidyTuesday") +
theme_wsj(color = "gray") +
theme(legend.position = "bottom",
plot.caption = element_text(size = 12),
plot.title = element_text(size = 25),
legend.text = element_text(family = "mono"))
ggsave("plots/plot_2019-05-14.png", width = 29, height = 21, units = "cm", dpi = "retina")
geom_line
画每个国家累积获奖次数的线条,geom_point
加0点,geom_flag
最高点加小国旗。scale
设置标尺,labs设标题,theme设主题,最后的图是这样子的:
轮到你了
你起码可以换一个奖项来画图吧?比如说医学奖拿来试试:
> select(nobel_winners, category) %>% unique
# A tibble: 6 x 1
category
<chr>
1 Chemistry
2 Literature
3 Medicine
4 Peace
5 Physics
6 Economics
往期精彩