R语言处理因子之forcats包介绍(2)
今天继续学习forcats
包的内容,上一篇主要介绍了forcats
包的主要内容,接下来将详细介绍每一个函数。
修改因子向量顺序
1.1 fct_relevel()
## 创建一个因子型向量
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
f
## [1] a b c d
## Levels: b c d a
## 把c,d放在地第1位,第2位
fct_relevel(f, c("c", "d"))
## [1] a b c d
## Levels: c d b a
## 把`a`放在第3的水平
fct_relevel(f, "a", after = 2)
## [1] a b c d
## Levels: b c a d
# 把`a`放到最后的位置
fct_relevel(f, "a", after = Inf)
## [1] a b c d
## Levels: b c d a
## 按照某个函数重新排序
fct_relevel(f, sort)
## [1] a b c d
## Levels: a b c d
## 注意这时的顺序是按照`sort(c("a","b","c","d"))`,不是按照`sort(f)`
## 按照随机顺序
fct_relevel(f, sample)
## [1] a b c d
## Levels: a b c d
## 反转顺序
fct_relevel(f, rev)
## [1] a b c d
## Levels: a d c b
下面是一个看起来很复杂,其实不复杂的例子,使用的是内置数据:gss_cat
,只选择其中的2列,我们的目标是把每一列中的Don't know
放到最后。
## 先看下原来的因子水平
df <- forcats::gss_cat[, c("rincome", "denom")]
lapply(df, levels) # 对df的每一列都使用`levels()`函数
## $rincome
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
##
## $denom
## [1] "No answer" "Don't know" "No denomination"
## [4] "Other" "Episcopal" "Presbyterian-dk wh"
## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which"
## [19] "Other methodist" "United methodist" "Afr meth ep zion"
## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists"
## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am"
## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
可以看到每一列都有一个Don't know
,我们要把它放到最后,顺便学习lapply
的用法。
# 对df的每一列使用`fct_relevel(..., "Don't know", after = Inf)`
df2 <- lapply(df, fct_relevel, "Don't know", after = Inf)
lapply(df2, levels) # 可以看到"Don't know"都被排在最后了
## $rincome
## [1] "No answer" "Refused" "$25000 or more" "$20000 - 24999"
## [5] "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" "$7000 to 7999"
## [9] "$6000 to 6999" "$5000 to 5999" "$4000 to 4999" "$3000 to 3999"
## [13] "$1000 to 2999" "Lt $1000" "Not applicable" "Don't know"
##
## $denom
## [1] "No answer" "No denomination" "Other"
## [4] "Episcopal" "Presbyterian-dk wh" "Presbyterian, merged"
## [7] "Other presbyterian" "United pres ch in us" "Presbyterian c in us"
## [10] "Lutheran-dk which" "Evangelical luth" "Other lutheran"
## [13] "Wi evan luth synod" "Lutheran-mo synod" "Luth ch in america"
## [16] "Am lutheran" "Methodist-dk which" "Other methodist"
## [19] "United methodist" "Afr meth ep zion" "Afr meth episcopal"
## [22] "Baptist-dk which" "Other baptists" "Southern baptist"
## [25] "Nat bapt conv usa" "Nat bapt conv of am" "Am bapt ch in usa"
## [28] "Am baptist asso" "Not applicable" "Don't know"
如果当前没有某个值会报错
fct_relevel(f, "e")
## Warning: Unknown levels in `f`: e
## [1] a b c d
## Levels: b c d a
1.2 fct_inorder()/fct_infreq()/fct_inseq()
这3个是同一家族函数,意思一样,具体用法稍有区别:
fct_inorder()
: 按照第一次出现的顺序fct_infreq()
: 按照每个水平出现的频率(从大到小)fct_inseq()
: 按照数字大小
f <- factor(c("b", "b", "a", "c", "c", "c"))
f #默认按字母顺序
## [1] b b a c c c
## Levels: a b c
fct_inorder(f) # 按第一次出现的顺序
## [1] b b a c c c
## Levels: b a c
fct_infreq(f) # 按出现的频率从大到小排列
## [1] b b a c c c
## Levels: c b a
f <- factor(1:3, levels = c("3", "2", "1"))
f
## [1] 1 2 3
## Levels: 3 2 1
fct_inseq(f) # 按照数字顺序排列,虽然你定义的顺序是"3", "2", "1"
## [1] 1 2 3
## Levels: 1 2 3
一个在画图中很有用的例子:
你画了一幅图如下:
library(ggplot2)
ggplot(starwars, aes(x = hair_color)) +
geom_bar() +
coord_flip()
但你发现这并不是你想要的,你想按照每一种的个数多少排列好画出来,你可以选择画图前就把顺序排好,或者像这样:
ggplot(starwars, aes(x = fct_infreq(hair_color))) +
geom_bar() +
coord_flip()
完美解决问题!
1.3 fct_reorder()/fct_recorder2()/last2()/first2()
fct_reorder()
对于因子映射到位置的一维显示非常有用;fct_reorder2()
用于2维显示,其中因子被映射到非位置。last2()
和first2()
是fct_reorder2()
的辅助函数;last2()
在y按照x排序时,查找y的最后一个值;first2()
查找第一个值。
## 生成一个简单的tibble
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
## 查看color这一列的顺序
df$color <- factor(df$color)
df$color
## [1] blue green purple red yellow
## Levels: blue green purple red yellow
按照a这一列从小到大的顺序,排序color这一列,可以看到color的levels已经变了
fct_reorder(df$color, df$a, min)
## [1] blue green purple red yellow
## Levels: blue red purple yellow green
fct_reorder()
用于画图小例子:
boxplot(Sepal.Width ~ Species, data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width), data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
fct_reorder2(df$color, df$a, df$b)
## [1] blue green purple red yellow
## Levels: purple red blue green yellow
fct_reorder2()
感觉很复杂的样子,但是你只要记住在画图的时候可能会用到它,神奇功能:使图例的顺序和线条的顺序一致。下面是一个小例子:
chks <- subset(ChickWeight, as.integer(Chick) < 10)
chks <- transform(chks, Chick = fct_shuffle(Chick))
chks
## weight Time Chick Diet
## 85 42 0 8 1
## 86 50 2 8 1
## 87 61 4 8 1
## 88 71 6 8 1
## 89 84 8 8 1
## 90 93 10 8 1
## 91 110 12 8 1
## 92 116 14 8 1
## 93 126 16 8 1
## 94 134 18 8 1
## 95 125 20 8 1
## 96 42 0 9 1
## 97 51 2 9 1
## 98 59 4 9 1
## 99 68 6 9 1
## 100 85 8 9 1
## 101 96 10 9 1
## 102 90 12 9 1
## 103 92 14 9 1
## 104 93 16 9 1
## 105 100 18 9 1
## 106 100 20 9 1
## 107 98 21 9 1
## 108 41 0 10 1
## 109 44 2 10 1
## 110 52 4 10 1
## 111 63 6 10 1
## 112 74 8 10 1
## 113 81 10 10 1
## 114 89 12 10 1
## 115 96 14 10 1
## 116 101 16 10 1
## 117 112 18 10 1
## 118 120 20 10 1
## 119 124 21 10 1
## 144 41 0 13 1
## 145 48 2 13 1
## 146 53 4 13 1
## 147 60 6 13 1
## 148 65 8 13 1
## 149 67 10 13 1
## 150 71 12 13 1
## 151 70 14 13 1
## 152 71 16 13 1
## 153 81 18 13 1
## 154 91 20 13 1
## 155 96 21 13 1
## 168 41 0 15 1
## 169 49 2 15 1
## 170 56 4 15 1
## 171 64 6 15 1
## 172 68 8 15 1
## 173 68 10 15 1
## 174 67 12 15 1
## 175 68 14 15 1
## 176 41 0 16 1
## 177 45 2 16 1
## 178 49 4 16 1
## 179 51 6 16 1
## 180 57 8 16 1
## 181 51 10 16 1
## 182 54 12 16 1
## 183 42 0 17 1
## 184 51 2 17 1
## 185 61 4 17 1
## 186 72 6 17 1
## 187 83 8 17 1
## 188 89 10 17 1
## 189 98 12 17 1
## 190 103 14 17 1
## 191 113 16 17 1
## 192 123 18 17 1
## 193 133 20 17 1
## 194 142 21 17 1
## 195 39 0 18 1
## 196 35 2 18 1
## 209 41 0 20 1
## 210 47 2 20 1
## 211 54 4 20 1
## 212 58 6 20 1
## 213 65 8 20 1
## 214 73 10 20 1
## 215 77 12 20 1
## 216 89 14 20 1
## 217 98 16 20 1
## 218 107 18 20 1
## 219 115 20 20 1
## 220 117 21 20 1
ggplot(chks, aes(Time, weight, colour = Chick)) +
geom_point() +
geom_line()
# 图例的顺序和线的顺序一样
ggplot(chks, aes(Time, weight, colour = fct_reorder2(Chick, Time, weight))) +
geom_point() +
geom_line() +
labs(colour = "Chick")
1.4 fct_shuffle()
随机重排,完全打乱顺序
f <- factor(c("a", "b", "c"))
f
## [1] a b c
## Levels: a b c
set.seed(111)
fct_shuffle(f) # 每次运行都会出现不同的顺序,除非设置种子数
## [1] a b c
## Levels: b a c
1.5 fct_rev()
反转顺序
f <- factor(c("a", "b", "c"))
f
## [1] a b c
## Levels: a b c
fct_rev(f)
## [1] a b c
## Levels: c b a
1.6 fct_shift()
将因子水平左右移动,默认向左移
x <- factor(
c("Mon", "Tue", "Wed"),
levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),
ordered = TRUE
)
x
## [1] Mon Tue Wed
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
fct_shift(x)
## [1] Mon Tue Wed
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
fct_shift(x, 2)
## [1] Mon Tue Wed
## Levels: Tue < Wed < Thu < Fri < Sat < Sun < Mon
fct_shift(x, -1)
## [1] Mon Tue Wed
## Levels: Sat < Sun < Mon < Tue < Wed < Thu < Fri
以上就是今天的内容,你学会了吗?欢迎点赞、评论、转发!
欢迎关注我的公众号:医学和生信笔记
“医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!
往期精彩内容:
在VScode中使用R语言
R语言ggsci配色包详解
R语言ggtern包画三元图详解
R语言画好看的聚类树