R语言处理因子之forcats包介绍(3)
今天继续学习forcats
包的内容,这是forcats
包介绍系列的第3篇。
修改因子向量名称
改变因子的值,同时保持原来的顺序(尽可能)
2.1 fct_anon()
用任意数字标识符替换因子级别。值和级别的顺序都不会被保留
gss_cat$relig %>% fct_count()
## # A tibble: 16 x 2
## f n
## <fct> <int>
## 1 No answer 93
## 2 Don't know 15
## 3 Inter-nondenominational 109
## 4 Native american 23
## 5 Christian 689
## 6 Orthodox-christian 95
## 7 Moslem/islam 104
## 8 Other eastern 32
## 9 Hinduism 71
## 10 Buddhism 147
## 11 Other 224
## 12 None 3523
## 13 Jewish 388
## 14 Catholic 5124
## 15 Protestant 10846
## 16 Not applicable 0
gss_cat$relig %>% fct_anon() %>% fct_count()
## # A tibble: 16 x 2
## f n
## <fct> <int>
## 1 01 32
## 2 02 224
## 3 03 93
## 4 04 3523
## 5 05 689
## 6 06 5124
## 7 07 10846
## 8 08 104
## 9 09 109
## 10 10 147
## 11 11 23
## 12 12 71
## 13 13 388
## 14 14 0
## 15 15 15
## 16 16 95
gss_cat$relig %>% fct_anon("X") %>% fct_count()
## # A tibble: 16 x 2
## f n
## <fct> <int>
## 1 X01 109
## 2 X02 5124
## 3 X03 224
## 4 X04 3523
## 5 X05 95
## 6 X06 0
## 7 X07 689
## 8 X08 93
## 9 X09 32
## 10 X10 147
## 11 X11 15
## 12 X12 71
## 13 X13 388
## 14 X14 104
## 15 X15 23
## 16 X16 10846
2.2 fct_collapse()
简单的说就是可以给因子分组。
fct_count(gss_cat$partyid)
## # A tibble: 10 x 2
## f n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind,near rep 1791
## 7 Independent 4119
## 8 Ind,near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
一共有10行,也就是10个水平,现在我们可以把10个水平分组,手动定义新的组:
partyid2 <- fct_collapse(gss_cat$partyid,
missing = c("No answer", "Don't know"),
rep = c("Strong republican", "Not str republican"),
other = "Other party",
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)
fct_count(partyid2)
## # A tibble: 5 x 2
## f n
## <fct> <int>
## 1 missing 155
## 2 other 393
## 3 rep 5346
## 4 ind 8409
## 5 dem 7180
2.3 fct_lump()
这个是一系列函数,可以将满足某些条件的水平合并为一组。如果你经常做机器学习、统计建模等工作,你可能会经常需要把一些占比比较低的组都变成“其他”组。Python中的pandas可以很容易做到,R语言当然也可以。
fct_lump_min()
: 把小于某些次数的归为其他类.fct_lump_prop()
: 把小于某个比例的归为其他类.fct_lump_n()
: 把个数最多的n个留下,其他的归为一类(如果n < 0,则个数最少的n个留下).fct_lump_lowfreq()
: 将最不频繁的级别合并在一起.
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% table()
## .
## A B C D E F G H I
## 40 10 5 27 1 1 1 1 1
把个数最多的3个留下,其他归为一类
x %>% fct_lump_n(3) %>% table() # ties.method = c("min", "average", "first", "last", "random", "max")
## .
## A B D Other
## 40 10 27 10
把个数最少的3个留下
x %>% fct_lump_n(-3) %>% table()
## .
## E F G H I Other
## 1 1 1 1 1 82
把比例小于0.1的归为一类
x %>% fct_lump_prop(0.1) %>% table()
## .
## A B D Other
## 40 10 27 10
把小于2次的归为其他类
x %>% fct_lump_min(2, other_level = "其他") %>% table()
## .
## A B C D 其他
## 40 10 5 27 5
把频率小的归为其他类,同时确保其他类仍然是频率最小的
x %>% fct_lump_lowfreq() %>% table()
## .
## A D Other
## 40 27 20
2.4 fct_other()
把某些因子归为其他类,类似于 fct_lump
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
# 把A,B留下,其他归为一类
fct_other(x, keep = c("A", "B"), other_level = "other")
## [1] A A A A A A A A A A A A
## [13] A A A A A A A A A A A A
## [25] A A A A A A A A A A A A
## [37] A A A A B B B B B B B B
## [49] B B other other other other other other other other other other
## [61] other other other other other other other other other other other other
## [73] other other other other other other other other other other other other
## [85] other other other
## Levels: A B other
# 把A,B归为一类,其他留下
fct_other(x, drop = c("A", "B"), other_level = "hhahah")
## [1] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [11] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [21] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [31] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [41] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [51] C C C C C D D D D D
## [61] D D D D D D D D D D
## [71] D D D D D D D D D D
## [81] D D E F G H I
## Levels: C D E F G H I hhahah
2.5 fct_recode()
手动更改因子水平
x <- factor(c("apple", "bear", "banana", "dear"))
x
## [1] apple bear banana dear
## Levels: apple banana bear dear
fct_recode(x, fruit = "apple", fruit = "banana")
## [1] fruit bear fruit dear
## Levels: fruit bear dear
fct_recode(x, NULL = "apple", fruit = "banana")
## [1] <NA> bear fruit dear
## Levels: fruit bear dear
fct_recode(x, "an apple" = "apple", "a bear" = "bear")
## [1] an apple a bear banana dear
## Levels: an apple banana a bear dear
x <- factor(c("apple", "bear", "banana", "dear"))
levels <- c(fruit = "apple", fruit = "banana")
fct_recode(x, !!!levels)
## [1] fruit bear fruit dear
## Levels: fruit bear dear
2.6 fct_relable()
gss_cat$partyid %>% fct_count()
## # A tibble: 10 x 2
## f n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind,near rep 1791
## 7 Independent 4119
## 8 Ind,near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
gss_cat$partyid %>% fct_relabel(~ gsub(",", ", ", .x)) %>% fct_count()
## # A tibble: 10 x 2
## f n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind, near rep 1791
## 7 Independent 4119
## 8 Ind, near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
以上就是今天的内容,欢迎点赞、关注、转发。
有任何问题欢迎评论区留言或直接添加我的微信!
欢迎关注我的公众号:医学和生信笔记
“医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!
往期精彩内容:
使用tinyarray包简化你的GEO分析流程!
使用tinyarray简化你的TCGA分析流程!
R语言和医学统计学系列(11):球形检验
R语言缺失值插补之simputation包