查看原文
其他

数据的分组汇总-基于R的reshape包

2015-09-14 刘顺祥 每天进步一点点2015

reshape包中的函数提供了解决汇总问题的统一办法,该包的核心思想是创造一个“熔化”的数据集版本(通过melt函数),然后将其投入(cast函数)到一个所希望的目标对象中。

通过melt函数“熔化”一个数据框、列表或数组,首先需要将变量分成编号变量分析变量默认情况下,该函数将因子和整数值变量设为编号变量,其余变量为分析变量。

metl函数语法:

1)熔化一个数据框:id.vars指定编号变量,measure.vars指定分析变量

melt(data, id.vars, measure.vars,

variable_name = "variable", na.rm = !preserve.na, preserve.na = TRUE, ...)

2)熔化一个数组

melt(data, varnames = names(dimnames(data)), ...)

3)熔化一个列表

melt(data, ..., level=1)


应用:本文所采用的数据为R自带的数据集state.x77、iris及随机数生成的数据框


#生成美国50个州的人口、收入的数据

states <- data.frame(state = row.names(state.x77),region = state.region,

state.x77,row.names = 1:50)

#查看数据集

> head(states)

state region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area

1 Alabama South 3615 3624 2.1 69.05 15.1 41.3 20 50708

2 Alaska West 365 6315 1.5 69.31 11.3 66.7 152 566432

3 Arizona West 2212 4530 1.8 70.55 7.8 58.1 15 113417

4 Arkansas South 2110 3378 1.9 70.66 10.1 39.9 65 51945

5 California West 21198 5114 1.1 71.71 10.3 62.6 20 156361

6 Colorado West 2541 4884 0.7 72.06 6.8 63.9 166 103766

#采用melt函数熔化数据框(states)

m_states <- melt(states)

#查看熔化后的数据结构

> head(m_states) #在不指定编号变量时,melt会显示被自动转为编号变量的变量名称

state region variable value

1 Alabama South Population 3615

2 Alaska West Population 365

3 Arizona West Population 2212

4 Arkansas South Population 2110

5 California West Population 21198

6 Colorado West Population 2541

#可以通过id.vars和measure.vars参数指定感兴趣的分析变量和分组变量

> head(melt(states,id.vars = 'state',measure.vars = 'Income'))

state variable value

1 Alabama Income 3624

2 Alaska Income 6315

3 Arizona Income 4530

4 Arkansas Income 3378

5 California Income 5114

6 Colorado Income 4884

#我们发现上面的“熔化”数据,除了指定的或默认的id变量,还会额外产生variable变量和value变量,这两个变量分别存放感兴趣的分析变量名称和实际的数值


通过melt函数,将数据框“熔化”后放入cast函数进行统计汇总。cast函数语法如下:

cast(data, formula = ... ~ variable, fun.aggregate=NULL, ...,

margins=FALSE, subset=TRUE, df=FALSE, fill=NULL, add.missing=FALSE,

value = guess_value(data))

其中data为一个“熔化”后的对象;formula为显式公式,公式左边代表输出结果的行变量,右边则代表输出结果的列变量;fun.aggregate为汇总函数,默认情况下使用length,最关键的是该参数可以指定一个自编函数。


应用:计算按地区分组的每个变量的均值

reg_mean1 <- cast(m_states,region~variable,mean)

reg_mean2 <- cast(m_states,variable~region,mean)

#查看以上两个结果,并比较formula参数颠倒的差异(前者以地区作为结果的行变量,后者以兴趣变量的名称作为行变量)

> head(reg_mean1)

region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area

1 Northeast 5495.111 4570.222 1.000000 71.26444 4.722222 53.96667 132.7778 18141.00

2 South 4208.125 4011.938 1.737500 69.70625 10.581250 44.34375 64.6250 54605.12

3 North Central 4803.000 4611.083 0.700000 71.76667 5.275000 54.51667 138.8333 62652.00

4 West 2915.308 4702.615 1.023077 71.23462 7.215385 62.00000 102.1538 134463.00

> head(reg_mean2)

variable Northeast South North Central West

1 Population 5495.111111 4208.12500 4803.00000 2915.307692

2 Income 4570.222222 4011.93750 4611.08333 4702.615385

3 Illiteracy 1.000000 1.73750 0.70000 1.023077

4 Life.Exp 71.264444 69.70625 71.76667 71.234615

5 Murder 4.722222 10.58125 5.27500 7.215385

6 HS.Grad 53.966667 44.34375 54.51667 62.000000


#接下来,创建一个自编函数应用到cast函数中,本次使用到的数据集为iris

fun <- function(x) {

require('fBasics')

n = sum(!is.na(x))

nmiss = sum(is.na(x))

m = mean(x,na.rm = TRUE)

s = sd(x,na.rm = TRUE)

max = max(x,na.rm = TRUE)

min = min(x,na.rm = TRUE)

range = max-min

skew = skewness(x,na.rm = TRUE)

kurto = kurtosis(x,na.rm = TRUE)

return(c(n = n,nmiss = nmiss,mean = m,sd = s,max = max,min = min,range = range,skewness = skew,kurtosis = kurto))

}

#数据“熔化”与汇总

m_iris <- melt(iris)

#为了方便版面显示,这里将输出结果设置为列表格式(注意,我在variable前面加了.|)

summary_result <- cast(m_iris,variable~.|Species,fun)

> summary_result

$setosa

variable n nmiss mean sd max min range skewness kurtosis

1 Sepal.Length 50 0 5.006 0.3524897 5.8 4.3 1.5 0.11297784 -0.4508724

2 Sepal.Width 50 0 3.428 0.3790644 4.4 2.3 2.1 0.03872946 0.5959507

3 Petal.Length 50 0 1.462 0.1736640 1.9 1.0 0.9 0.10009538 0.6539303

4 Petal.Width 50 0 0.246 0.1053856 0.6 0.1 0.5 1.17963278 1.2587179


$versicolor

variable n nmiss mean sd max min range skewness kurtosis

1 Sepal.Length 50 0 5.936 0.5161711 7.0 4.9 2.1 0.09913926 -0.6939138

2 Sepal.Width 50 0 2.770 0.3137983 3.4 2.0 1.4 -0.34136443 -0.5493203

3 Petal.Length 50 0 4.260 0.4699110 5.1 3.0 2.1 -0.57060243 -0.1902555

4 Petal.Width 50 0 1.326 0.1977527 1.8 1.0 0.8 -0.02933377 -0.5873144


$virginica

variable n nmiss mean sd max min range skewness kurtosis

1 Sepal.Length 50 0 6.588 0.6358796 7.9 4.9 3.0 0.1110286 -0.2032597

2 Sepal.Width 50 0 2.974 0.3224966 3.8 2.2 1.6 0.3442849 0.3803832

3 Petal.Length 50 0 5.552 0.5518947 6.9 4.5 2.4 0.5169175 -0.3651161

4 Petal.Width 50 0 2.026 0.2746501 2.5 1.4 1.1 -0.1218119 -0.7539586


如果想选择特定的分析变量,可以通过subset参数实现,一般与%in%联合使用。a%in%b 表示a的元素是否为b的子集

#分析Sepal.Length变量的汇总信息

sl_summary <- cast(m_iris,Species~variable,fun,subset = variable %in% 'Sepal.Length')

> sl_summary

Species Sepal.Length_n Sepal.Length_nmiss Sepal.Length_mean Sepal.Length_sd Sepal.Length_max Sepal.Length_min

1 setosa 50 0 5.006 0.3524897 5.8 4.3

2 versicolor 50 0 5.936 0.5161711 7.0 4.9

3 virginica 50 0 6.588 0.6358796 7.9 4.9

Sepal.Length_range Sepal.Length_skewness Sepal.Length_kurtosis

1 1.5 0.11297784 -0.4508724

2 2.1 0.09913926 -0.6939138

3 3.0 0.11102862 -0.2032597


对于多个变量的分组统计,cast函数中显式公式的左边或右边用+连接多个分组变量

应用:使用随机数函数生成泊松分布的离散变量

#模拟数据的生成

data <- data.frame(x = rpois(100,2),y = rpois(100,3),z = runif(100,10,20))

> head(data)

x y z

1 1 2 14.48573

2 2 3 16.50317

3 2 2 19.27258

4 1 4 18.86011

5 2 4 15.97173

6 3 4 10.88769

#数据“熔化”

m_data <- melt(data,measure.vars = 'z')

> head(m_data)

x y variable value

1 1 2 z 14.48573

2 2 3 z 16.50317

3 2 2 z 19.27258

4 1 4 z 18.86011

5 2 4 z 15.97173

6 3 4 z 10.88769

#多变量分组统计

summary_data <- cast(m_data,x+y~variable,c(mean,min,max,median))

> head(summary_data)

x y z_mean z_min z_max z_median

1 0 0 17.90826 17.90826 17.90826 17.90826

2 0 2 17.68077 17.68077 17.68077 17.68077

3 0 3 14.43726 12.43668 16.96619 13.90891

4 0 4 16.54500 13.90112 19.18888 16.54500

5 0 5 12.87987 12.87987 12.87987 12.87987

6 0 6 16.51866 16.51866 16.51866 16.51866

#需要注意的是,表达式左边的最后一个变量是变化最快的


最后再介绍cast函数中显式公式的几种变形:

#以y的每一个值单独成一个列,统计x分组下的汇总值

reshape1 <- cast(m_data,x~y+variable,mean)

> reshape1

x 0_z 1_z 2_z 3_z 4_z 5_z 6_z 7_z

1 0 17.90826 NaN 17.68077 14.43726 16.54500 12.87987 16.51866 NaN

2 1 15.66487 15.67371 15.74976 14.99201 15.58618 17.01104 11.25401 15.05334

3 2 NaN 14.34325 14.31375 15.82231 16.09901 15.45799 NaN NaN

4 3 13.60504 16.51666 16.03237 17.17504 15.41220 11.25861 11.23605 NaN

5 4 16.86313 14.29701 12.59724 NaN 10.60965 14.90316 NaN NaN

6 5 NaN 18.48920 19.61777 NaN NaN NaN NaN 14.27881

7 7 15.55771 NaN NaN NaN NaN NaN NaN NaN

8 8 NaN NaN 12.95656 NaN NaN NaN NaN NaN

#这里的NaN表示x和y的组合下没有z的值


#用竖线(|)隔开variable和y,返回列表形式,y值为列表的元素,元素内容又以x分组统计

reshape2 <- cast(m_data,x~variable|y,mean)

> reshape2

$`0`

x z

1 0 17.90826

2 1 15.66487

3 3 13.60504

4 4 16.86313

5 7 15.55771


$`1`

x z

1 1 15.67371

2 2 14.34325

3 3 16.51666

4 4 14.29701

5 5 18.48920

...

...

...

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存