数据的分组汇总-基于R的reshape包
reshape包中的函数提供了解决汇总问题的统一办法,该包的核心思想是创造一个“熔化”的数据集版本(通过melt函数),然后将其投入(cast函数)到一个所希望的目标对象中。
通过melt函数“熔化”一个数据框、列表或数组,首先需要将变量分成编号变量和分析变量。默认情况下,该函数将因子和整数值变量设为编号变量,其余变量为分析变量。
metl函数语法:
1)熔化一个数据框:id.vars指定编号变量,measure.vars指定分析变量
melt(data, id.vars, measure.vars,
variable_name = "variable", na.rm = !preserve.na, preserve.na = TRUE, ...)
2)熔化一个数组
melt(data, varnames = names(dimnames(data)), ...)
3)熔化一个列表
melt(data, ..., level=1)
应用:本文所采用的数据为R自带的数据集state.x77、iris及随机数生成的数据框
#生成美国50个州的人口、收入的数据
states <- data.frame(state = row.names(state.x77),region = state.region,
state.x77,row.names = 1:50)
#查看数据集
> head(states)
state region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
1 Alabama South 3615 3624 2.1 69.05 15.1 41.3 20 50708
2 Alaska West 365 6315 1.5 69.31 11.3 66.7 152 566432
3 Arizona West 2212 4530 1.8 70.55 7.8 58.1 15 113417
4 Arkansas South 2110 3378 1.9 70.66 10.1 39.9 65 51945
5 California West 21198 5114 1.1 71.71 10.3 62.6 20 156361
6 Colorado West 2541 4884 0.7 72.06 6.8 63.9 166 103766
#采用melt函数熔化数据框(states)
m_states <- melt(states)
#查看熔化后的数据结构
> head(m_states) #在不指定编号变量时,melt会显示被自动转为编号变量的变量名称
state region variable value
1 Alabama South Population 3615
2 Alaska West Population 365
3 Arizona West Population 2212
4 Arkansas South Population 2110
5 California West Population 21198
6 Colorado West Population 2541
#可以通过id.vars和measure.vars参数指定感兴趣的分析变量和分组变量
> head(melt(states,id.vars = 'state',measure.vars = 'Income'))
state variable value
1 Alabama Income 3624
2 Alaska Income 6315
3 Arizona Income 4530
4 Arkansas Income 3378
5 California Income 5114
6 Colorado Income 4884
#我们发现上面的“熔化”数据,除了指定的或默认的id变量,还会额外产生variable变量和value变量,这两个变量分别存放感兴趣的分析变量名称和实际的数值
通过melt函数,将数据框“熔化”后放入cast函数进行统计汇总。cast函数语法如下:
cast(data, formula = ... ~ variable, fun.aggregate=NULL, ...,
margins=FALSE, subset=TRUE, df=FALSE, fill=NULL, add.missing=FALSE,
value = guess_value(data))
其中data为一个“熔化”后的对象;formula为显式公式,公式左边代表输出结果的行变量,右边则代表输出结果的列变量;fun.aggregate为汇总函数,默认情况下使用length,最关键的是该参数可以指定一个自编函数。
应用:计算按地区分组的每个变量的均值
reg_mean1 <- cast(m_states,region~variable,mean)
reg_mean2 <- cast(m_states,variable~region,mean)
#查看以上两个结果,并比较formula参数颠倒的差异(前者以地区作为结果的行变量,后者以兴趣变量的名称作为行变量)
> head(reg_mean1)
region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
1 Northeast 5495.111 4570.222 1.000000 71.26444 4.722222 53.96667 132.7778 18141.00
2 South 4208.125 4011.938 1.737500 69.70625 10.581250 44.34375 64.6250 54605.12
3 North Central 4803.000 4611.083 0.700000 71.76667 5.275000 54.51667 138.8333 62652.00
4 West 2915.308 4702.615 1.023077 71.23462 7.215385 62.00000 102.1538 134463.00
> head(reg_mean2)
variable Northeast South North Central West
1 Population 5495.111111 4208.12500 4803.00000 2915.307692
2 Income 4570.222222 4011.93750 4611.08333 4702.615385
3 Illiteracy 1.000000 1.73750 0.70000 1.023077
4 Life.Exp 71.264444 69.70625 71.76667 71.234615
5 Murder 4.722222 10.58125 5.27500 7.215385
6 HS.Grad 53.966667 44.34375 54.51667 62.000000
#接下来,创建一个自编函数应用到cast函数中,本次使用到的数据集为iris
fun <- function(x) {
require('fBasics')
n = sum(!is.na(x))
nmiss = sum(is.na(x))
m = mean(x,na.rm = TRUE)
s = sd(x,na.rm = TRUE)
max = max(x,na.rm = TRUE)
min = min(x,na.rm = TRUE)
range = max-min
skew = skewness(x,na.rm = TRUE)
kurto = kurtosis(x,na.rm = TRUE)
return(c(n = n,nmiss = nmiss,mean = m,sd = s,max = max,min = min,range = range,skewness = skew,kurtosis = kurto))
}
#数据“熔化”与汇总
m_iris <- melt(iris)
#为了方便版面显示,这里将输出结果设置为列表格式(注意,我在variable前面加了.|)
summary_result <- cast(m_iris,variable~.|Species,fun)
> summary_result
$setosa
variable n nmiss mean sd max min range skewness kurtosis
1 Sepal.Length 50 0 5.006 0.3524897 5.8 4.3 1.5 0.11297784 -0.4508724
2 Sepal.Width 50 0 3.428 0.3790644 4.4 2.3 2.1 0.03872946 0.5959507
3 Petal.Length 50 0 1.462 0.1736640 1.9 1.0 0.9 0.10009538 0.6539303
4 Petal.Width 50 0 0.246 0.1053856 0.6 0.1 0.5 1.17963278 1.2587179
$versicolor
variable n nmiss mean sd max min range skewness kurtosis
1 Sepal.Length 50 0 5.936 0.5161711 7.0 4.9 2.1 0.09913926 -0.6939138
2 Sepal.Width 50 0 2.770 0.3137983 3.4 2.0 1.4 -0.34136443 -0.5493203
3 Petal.Length 50 0 4.260 0.4699110 5.1 3.0 2.1 -0.57060243 -0.1902555
4 Petal.Width 50 0 1.326 0.1977527 1.8 1.0 0.8 -0.02933377 -0.5873144
$virginica
variable n nmiss mean sd max min range skewness kurtosis
1 Sepal.Length 50 0 6.588 0.6358796 7.9 4.9 3.0 0.1110286 -0.2032597
2 Sepal.Width 50 0 2.974 0.3224966 3.8 2.2 1.6 0.3442849 0.3803832
3 Petal.Length 50 0 5.552 0.5518947 6.9 4.5 2.4 0.5169175 -0.3651161
4 Petal.Width 50 0 2.026 0.2746501 2.5 1.4 1.1 -0.1218119 -0.7539586
如果想选择特定的分析变量,可以通过subset参数实现,一般与%in%联合使用。a%in%b 表示a的元素是否为b的子集
#分析Sepal.Length变量的汇总信息
sl_summary <- cast(m_iris,Species~variable,fun,subset = variable %in% 'Sepal.Length')
> sl_summary
Species Sepal.Length_n Sepal.Length_nmiss Sepal.Length_mean Sepal.Length_sd Sepal.Length_max Sepal.Length_min
1 setosa 50 0 5.006 0.3524897 5.8 4.3
2 versicolor 50 0 5.936 0.5161711 7.0 4.9
3 virginica 50 0 6.588 0.6358796 7.9 4.9
Sepal.Length_range Sepal.Length_skewness Sepal.Length_kurtosis
1 1.5 0.11297784 -0.4508724
2 2.1 0.09913926 -0.6939138
3 3.0 0.11102862 -0.2032597
对于多个变量的分组统计,cast函数中显式公式的左边或右边用+连接多个分组变量
应用:使用随机数函数生成泊松分布的离散变量
#模拟数据的生成
data <- data.frame(x = rpois(100,2),y = rpois(100,3),z = runif(100,10,20))
> head(data)
x y z
1 1 2 14.48573
2 2 3 16.50317
3 2 2 19.27258
4 1 4 18.86011
5 2 4 15.97173
6 3 4 10.88769
#数据“熔化”
m_data <- melt(data,measure.vars = 'z')
> head(m_data)
x y variable value
1 1 2 z 14.48573
2 2 3 z 16.50317
3 2 2 z 19.27258
4 1 4 z 18.86011
5 2 4 z 15.97173
6 3 4 z 10.88769
#多变量分组统计
summary_data <- cast(m_data,x+y~variable,c(mean,min,max,median))
> head(summary_data)
x y z_mean z_min z_max z_median
1 0 0 17.90826 17.90826 17.90826 17.90826
2 0 2 17.68077 17.68077 17.68077 17.68077
3 0 3 14.43726 12.43668 16.96619 13.90891
4 0 4 16.54500 13.90112 19.18888 16.54500
5 0 5 12.87987 12.87987 12.87987 12.87987
6 0 6 16.51866 16.51866 16.51866 16.51866
#需要注意的是,表达式左边的最后一个变量是变化最快的
最后再介绍cast函数中显式公式的几种变形:
#以y的每一个值单独成一个列,统计x分组下的汇总值
reshape1 <- cast(m_data,x~y+variable,mean)
> reshape1
x 0_z 1_z 2_z 3_z 4_z 5_z 6_z 7_z
1 0 17.90826 NaN 17.68077 14.43726 16.54500 12.87987 16.51866 NaN
2 1 15.66487 15.67371 15.74976 14.99201 15.58618 17.01104 11.25401 15.05334
3 2 NaN 14.34325 14.31375 15.82231 16.09901 15.45799 NaN NaN
4 3 13.60504 16.51666 16.03237 17.17504 15.41220 11.25861 11.23605 NaN
5 4 16.86313 14.29701 12.59724 NaN 10.60965 14.90316 NaN NaN
6 5 NaN 18.48920 19.61777 NaN NaN NaN NaN 14.27881
7 7 15.55771 NaN NaN NaN NaN NaN NaN NaN
8 8 NaN NaN 12.95656 NaN NaN NaN NaN NaN
#这里的NaN表示x和y的组合下没有z的值
#用竖线(|)隔开variable和y,返回列表形式,y值为列表的元素,元素内容又以x分组统计
reshape2 <- cast(m_data,x~variable|y,mean)
> reshape2
$`0`
x z
1 0 17.90826
2 1 15.66487
3 3 13.60504
4 4 16.86313
5 7 15.55771
$`1`
x z
1 1 15.67371
2 2 14.34325
3 3 16.51666
4 4 14.29701
5 5 18.48920
...
...
...