dplyr强大的分组汇总
在现实生活中我们经常会遇到非常多需要分组汇总的情况,单个的汇总价值不大,只有分组之后,才能看出差异,才能表现出数据的价值。
dplyr
为我们提供了group_by()
函数,主要使用group_by()
对数据进行分组,然后再进行各种计算,通过和其他操作进行连接,发挥更加强大的作用。
group_by()
查看分组信息
增加或改变用于聚合的变量
移除聚合的变量
联合使用
summarise()
`select()`/`rename()`/`relocate()`
arrange()
`muatate()` and `transmutate()`
filter()
group_by()
先建立2个分组数据进行演示,还是使用星战数据集。
by_species <- starwars %>% group_by(species)
by_sex_gender <- starwars %>% group_by(sex, gender)
看看这两个对象有什么不同,可以看出和原数据集没什么不同,但是都被分组了!
by_species
## # A tibble: 87 x 14
## # Groups: species [38]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
by_sex_gender
## # A tibble: 87 x 14
## # Groups: sex, gender [6]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
使用tally()
函数进行计数:
by_species %>% tally(sort = T)
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Human 35
## 2 Droid 6
## 3 <NA> 4
## 4 Gungan 3
## 5 Kaminoan 2
## 6 Mirialan 2
## 7 Twi'lek 2
## 8 Wookiee 2
## 9 Zabrak 2
## 10 Aleena 1
## # ... with 28 more rows
和下面这个操作是一样的效果:
by_species %>% summarise(n=n())
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # ... with 28 more rows
除了根据现有的变量进行分组外,还可以根据现有变量的函数进行分组,这样做类似于先mutate()
再group_by()
。
bmi_breaks <- c(0,18.5,25,30,Inf)
starwars %>%
group_by(bmi_cat = cut(mass/(height/100)^2,breaks = bmi_breaks)) %>%
tally(sort = T)
## # A tibble: 5 x 2
## bmi_cat n
## <fct> <int>
## 1 <NA> 28
## 2 (18.5,25] 24
## 3 (25,30] 13
## 4 (30,Inf] 12
## 5 (0,18.5] 10
是不是很神奇?
查看分组信息
group_keys()
查看用于分组的组内有哪些类别,可以看到species有38种:
by_species %>% group_keys()
## # A tibble: 38 x 1
## species
## <chr>
## 1 Aleena
## 2 Besalisk
## 3 Cerean
## 4 Chagrian
## 5 Clawdite
## 6 Droid
## 7 Dug
## 8 Ewok
## 9 Geonosian
## 10 Gungan
## # ... with 28 more rows
by_sex_gender %>% group_keys()
## # A tibble: 6 x 2
## sex gender
## <chr> <chr>
## 1 female feminine
## 2 hermaphroditic masculine
## 3 male masculine
## 4 none feminine
## 5 none masculine
## 6 <NA> <NA>
group_indices()
查看每一行属于哪个组:
by_species %>% group_indices() # which group each row belongs to
## [1] 11 6 6 11 11 11 11 6 11 11 11 11 34 11 24 12 11 11 36 11 11 6 31 11 11
## [26] 18 11 11 8 26 11 21 11 10 10 10 38 30 7 38 11 37 32 32 33 35 29 11 3 20
## [51] 37 27 13 23 16 4 11 11 11 9 17 17 11 11 11 11 5 2 15 15 11 1 6 25 19
## [76] 28 14 34 11 38 22 11 11 11 6 38 11
group_rows()
查看每个组包括哪些行:
by_species %>% group_rows() # which rows each group contains with
## <list_of<integer>[38]>
## [[1]]
## [1] 72
##
## [[2]]
## [1] 68
##
## [[3]]
## [1] 49
##
## [[4]]
## [1] 56
##
## [[5]]
## [1] 67
##
## [[6]]
## [1] 2 3 8 22 73 85
##
## [[7]]
## [1] 39
##
## [[8]]
## [1] 29
##
## [[9]]
## [1] 60
##
## [[10]]
## [1] 34 35 36
##
## [[11]]
## [1] 1 4 5 6 7 9 10 11 12 14 17 18 20 21 24 25 27 28 31 33 41 48 57 58 59
## [26] 63 64 65 66 71 79 82 83 84 87
##
## [[12]]
## [1] 16
##
## [[13]]
## [1] 53
##
## [[14]]
## [1] 77
##
## [[15]]
## [1] 69 70
##
## [[16]]
## [1] 55
##
## [[17]]
## [1] 61 62
##
## [[18]]
## [1] 26
##
## [[19]]
## [1] 75
##
## [[20]]
## [1] 50
##
## [[21]]
## [1] 32
##
## [[22]]
## [1] 81
##
## [[23]]
## [1] 54
##
## [[24]]
## [1] 15
##
## [[25]]
## [1] 74
##
## [[26]]
## [1] 30
##
## [[27]]
## [1] 52
##
## [[28]]
## [1] 76
##
## [[29]]
## [1] 47
##
## [[30]]
## [1] 38
##
## [[31]]
## [1] 23
##
## [[32]]
## [1] 43 44
##
## [[33]]
## [1] 45
##
## [[34]]
## [1] 13 78
##
## [[35]]
## [1] 46
##
## [[36]]
## [1] 19
##
## [[37]]
## [1] 42 51
##
## [[38]]
## [1] 37 40 80 86
group_vars()
查看用于聚合的变量名字:
by_sex_gender %>% group_vars() # the name of the grouping variable
## [1] "sex" "gender"
增加或改变用于聚合的变量
如果把group_by()
作用于已经聚合的变量,那数据会被覆盖,比如下面这个,by_species
已经被species
聚合了,再通过homeworld
聚合,那结果只是homeworld
的结果:
by_species %>%
group_by(homeworld) %>%
tally()
## # A tibble: 49 x 2
## homeworld n
## <chr> <int>
## 1 Alderaan 3
## 2 Aleen Minor 1
## 3 Bespin 1
## 4 Bestine IV 1
## 5 Cato Neimoidia 1
## 6 Cerea 1
## 7 Champala 1
## 8 Chandrila 1
## 9 Concord Dawn 1
## 10 Corellia 2
## # ... with 39 more rows
“是不是之前没注意过这些小问题?
通过使用一个参数可以避免这个问题:
by_species %>%
group_by(homeworld, .add = T) %>%
tally()
## # A tibble: 58 x 3
## # Groups: species [38]
## species homeworld n
## <chr> <chr> <int>
## 1 Aleena Aleen Minor 1
## 2 Besalisk Ojom 1
## 3 Cerean Cerea 1
## 4 Chagrian Champala 1
## 5 Clawdite Zolan 1
## 6 Droid Naboo 1
## 7 Droid Tatooine 2
## 8 Droid <NA> 3
## 9 Dug Malastare 1
## 10 Ewok Endor 1
## # ... with 48 more rows
移除聚合的变量
一个被聚合的数据如果不解除聚合,那么后面的操作都会以聚合后的结果呈现出来,所以聚合之后一定要记得解除聚合!
by_species %>%
ungroup() %>%
tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 87
by_sex_gender %>%
ungroup(sex) %>%
tally()
## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 feminine 17
## 2 masculine 66
## 3 <NA> 4
联合使用
下面这部分主要介绍group_by和其他函数的联合使用:
summarise()
by_species %>%
summarise(
n = n(),
height = mean(height, na.rm = TRUE)
)
## # A tibble: 38 x 3
## species n height
## <chr> <int> <dbl>
## 1 Aleena 1 79
## 2 Besalisk 1 198
## 3 Cerean 1 198
## 4 Chagrian 1 196
## 5 Clawdite 1 168
## 6 Droid 6 131.
## 7 Dug 1 112
## 8 Ewok 1 88
## 9 Geonosian 1 183
## 10 Gungan 3 209.
## # ... with 28 more rows
“control the grouping variables
通过.groups
参数控制聚合变量:
by_sex_gender %>%
summarise(n = n()) %>%
group_vars()
## `summarise()` has grouped output by 'sex'. You can override using the
## `.groups` argument.
## [1] "sex"
# 只通过sex进行聚合
by_sex_gender %>%
summarise(n = n(), .groups = "drop_last") %>%
group_vars()
## [1] "sex"
by_sex_gender %>%
summarise(n = n(), .groups = "keep") %>%
group_vars()
## [1] "sex" "gender"
# 不聚合了
by_sex_gender %>%
summarise(n = n(), .groups = "drop") %>%
group_vars()
## character(0)
select()
/rename()
/relocate()
by_species %>% select(mass) # grouped by species
## Adding missing grouping variables: `species`
## # A tibble: 87 x 2
## # Groups: species [38]
## species mass
## <chr> <dbl>
## 1 Human 77
## 2 Droid 75
## 3 Droid 32
## 4 Human 136
## 5 Human 49
## 6 Human 120
## 7 Human 75
## 8 Droid 32
## 9 Human 84
## 10 Human 77
## # ... with 77 more rows
by_species %>%
ungroup() %>%
select(mass)
## # A tibble: 87 x 1
## mass
## <dbl>
## 1 77
## 2 75
## 3 32
## 4 136
## 5 49
## 6 120
## 7 75
## 8 32
## 9 84
## 10 77
## # ... with 77 more rows
arrange()
by_species %>%
arrange(desc(mass)) %>%
relocate(species, mass)
## # A tibble: 87 x 14
## # Groups: species [38]
## species mass name height hair_color skin_color eye_color birth_year sex
## <chr> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <chr>
## 1 Hutt 1358 Jabba~ 175 <NA> green-tan~ orange 600 herm~
## 2 Kaleesh 159 Griev~ 216 none brown, wh~ green, y~ NA male
## 3 Droid 140 IG-88 200 none metal red 15 none
## 4 Human 136 Darth~ 202 none white yellow 41.9 male
## 5 Wookiee 136 Tarff~ 234 brown brown blue NA male
## 6 Human 120 Owen ~ 178 brown, gr~ light blue 52 male
## 7 Trandos~ 113 Bossk 190 none green red 53 male
## 8 Wookiee 112 Chewb~ 228 brown unknown blue 200 male
## 9 Human 110 Jek T~ 180 brown fair blue NA male
## 10 Besalisk 102 Dexte~ 198 none brown yellow NA male
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## # films <list>, vehicles <list>, starships <list>
通过.by_group
参数控制进行排序的先后位置,下面这个例子就是先根据species进行排序,再根据mass进行排序,和上面的不一样哦!
by_species %>%
arrange(desc(mass), .by_group = T) %>%
relocate(species, mass)
## # A tibble: 87 x 14
## # Groups: species [38]
## species mass name height hair_color skin_color eye_color birth_year sex
## <chr> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <chr>
## 1 Aleena 15 Ratts~ 79 none grey, blue unknown NA male
## 2 Besalisk 102 Dexte~ 198 none brown yellow NA male
## 3 Cerean 82 Ki-Ad~ 198 white pale yellow 92 male
## 4 Chagrian NA Mas A~ 196 none blue blue NA male
## 5 Clawdite 55 Zam W~ 168 blonde fair, gre~ yellow NA fema~
## 6 Droid 140 IG-88 200 none metal red 15 none
## 7 Droid 75 C-3PO 167 <NA> gold yellow 112 none
## 8 Droid 32 R2-D2 96 <NA> white, bl~ red 33 none
## 9 Droid 32 R5-D4 97 <NA> white, red red NA none
## 10 Droid NA R4-P17 96 none silver, r~ red, blue NA none
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## # films <list>, vehicles <list>, starships <list>
muatate()
and transmutate()
starwars %>%
select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(means = mean(mass, na.rm = T),
standard_mass = mass - mean(mass, na.rm = T))
## # A tibble: 87 x 5
## # Groups: homeworld [49]
## name homeworld mass means standard_mass
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Luke Skywalker Tatooine 77 85.4 -8.38
## 2 C-3PO Tatooine 75 85.4 -10.4
## 3 R2-D2 Naboo 32 64.2 -32.2
## 4 Darth Vader Tatooine 136 85.4 50.6
## 5 Leia Organa Alderaan 49 64 -15
## 6 Owen Lars Tatooine 120 85.4 34.6
## 7 Beru Whitesun lars Tatooine 75 85.4 -10.4
## 8 R5-D4 Tatooine 32 85.4 -53.4
## 9 Biggs Darklighter Tatooine 84 85.4 -1.38
## 10 Obi-Wan Kenobi Stewjon 77 77 0
## # ... with 77 more rows
min_rank()
函数返回顺序(秩次):
# Overall rank
starwars %>%
select(name, homeworld, height) %>%
mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## name homeworld height rank
## <chr> <chr> <int> <int>
## 1 Luke Skywalker Tatooine 172 29
## 2 C-3PO Tatooine 167 21
## 3 R2-D2 Naboo 96 5
## 4 Darth Vader Tatooine 202 72
## 5 Leia Organa Alderaan 150 11
## 6 Owen Lars Tatooine 178 35
## 7 Beru Whitesun lars Tatooine 165 17
## 8 R5-D4 Tatooine 97 7
## 9 Biggs Darklighter Tatooine 183 45
## 10 Obi-Wan Kenobi Stewjon 182 44
## # ... with 77 more rows
先根据homeworld进行分组,再新建列:
# Rank per homeworld
starwars %>%
select(name, homeworld, height) %>%
group_by(homeworld) %>%
mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## # Groups: homeworld [49]
## name homeworld height rank
## <chr> <chr> <int> <int>
## 1 Luke Skywalker Tatooine 172 5
## 2 C-3PO Tatooine 167 4
## 3 R2-D2 Naboo 96 1
## 4 Darth Vader Tatooine 202 10
## 5 Leia Organa Alderaan 150 1
## 6 Owen Lars Tatooine 178 6
## 7 Beru Whitesun lars Tatooine 165 3
## 8 R5-D4 Tatooine 97 1
## 9 Biggs Darklighter Tatooine 183 7
## 10 Obi-Wan Kenobi Stewjon 182 1
## # ... with 77 more rows
filter()
筛选每个物种(species)中最高(height)的那一个:
by_species %>%
select(name, species, height) %>%
filter(height == max(height))
## # A tibble: 35 x 3
## # Groups: species [35]
## name species height
## <chr> <chr> <int>
## 1 Greedo Rodian 173
## 2 Jabba Desilijic Tiure Hutt 175
## 3 Yoda Yoda's species 66
## 4 Bossk Trandoshan 190
## 5 Ackbar Mon Calamari 180
## 6 Wicket Systri Warrick Ewok 88
## 7 Nien Nunb Sullustan 160
## 8 Nute Gunray Neimodian 191
## 9 Roos Tarpals Gungan 224
## 10 Watto Toydarian 137
## # ... with 25 more rows
去掉只有1个成员的物种:
by_species %>%
filter(n() != 1) %>%
tally()
## # A tibble: 9 x 2
## species n
## <chr> <int>
## 1 Droid 6
## 2 Gungan 3
## 3 Human 35
## 4 Kaminoan 2
## 5 Mirialan 2
## 6 Twi'lek 2
## 7 Wookiee 2
## 8 Zabrak 2
## 9 <NA> 4
以上就是今天的内容,希望对你有帮助哦!欢迎点赞、在看、关注、转发!
欢迎扫描二维码加 QQ群 :613637742
欢迎关注公众号:医学和生信笔记
往期回顾
R语言和医学统计学系列(11):球形检验
R语言tidy风格医学统计学
R语言多个变量同时进行t检验、方差分析等
1万个基因批量wilcoxon检验?
R语言机器学习R包:mlr3(合辑)
使用zotero和obsidian管理和阅读文献