变量筛选:逐步回归&最优子集
影响因素比较多的时候,变量间常常存在着各种各样的关系,单因素的分析结果往往不可靠,我们常常采用的解决方案是多因素的回归分析。而在进行多因素的回归分析时,如何处理无统计学意义的变量是避不开的问题:保留还是舍弃?这就涉及到变量筛选的问题。
变量筛选不是简单地将多因素回归中无统计学意义的变量删除,而是遵循需要一些特殊的方法。变量筛选专业判断第一,统计方法有逐步回归(先前选择法、向后剔除法、双向选择法)、最优子集筛选、lasso回归等。
Stepwise Linear Model Regression:
stepwise(formula,data,include = NULL,
selection = c("forward", "backward", "bidirection", "score"),
select = c("AIC", "AICc", "BIC", "CP", "HQ", "HQc", "Rsq", "adjRsq", "SL", "SBC"),
sle = 0.15,sls = 0.15,
multivarStat = c("Pillai", "Wilks", "Hotelling-Lawley", "Roy"),
weights = NULL,best = NULL)
Stepwise Logistic Regression
stepwiseLogit(formula,data,include = NULL,
selection = c("forward", "backward", "bidirection", "score"),
select = c("SL", "AIC", "AICc", "SBC", "HQ", "HQc", "IC(3/2)", "IC(1)"),
sle = 0.15,sls = 0.15,
sigMethod = c("Rao", "LRT"),
weights = NULL,best = NULL)
Stepwise Cox Proportional Hazards Regression
stepwiseCox(formula,data,include= NULL,
selection = c("forward", "backward", "bidirection", "score"),
select = c("SL", "AIC", "AICc", "SBC", "HQ", "HQc", "IC(3/2)", "IC(1)"),
sle = 0.15,sls = 0.15,
method = c("efron", "breslow", "exact"),
weights = NULL,best = NULL)
示例:同<<因变量二分类资料的Probit回归>>中的数据。
library(haven)
fitdata<-read_sav("D:/Temp/logistic_step.sav")
fitdata<-as.data.frame(fitdata)
fitdata<-na.omit(fitdata)
fitdata$race<-factor(fitdata$race,levels=c(1,2,3),labels=c("White","Black","Others"))
fitdata$smoke<-factor(fitdata$smoke,levels=c(0,1),labels=c("No","Yes"))
fitdata$ptl <-ifelse(fitdata$ptl>0,1, 0)
##考虑到样本量的问题,早产次数ptl宜合并为二分类变量。当原ptl值>0时,新的ptl赋值为0,其他情况下(原ptl<=0)时则赋值为0
fitdata$ptl<-factor(fitdata$ptl,levels=c(0,1),labels=c("No","Yes"))
fitdata$ht<-factor(fitdata$ht,levels=c(0,1),labels=c("No","Yes"))
fitdata$ui<-factor(fitdata$ui,levels=c(0,1),labels=c("No","Yes"))
library(StepReg)
stepwise(bwt~age+lwt+race+smoke+ptl+ht+ui+ftv,data=fitdata,selection="bidirection",select="SL",sle=0.05,sls=0.1)
##双向逐步回归,变量进入标准P=0.05,剔除标准P=0.1
结果解读参见前面的推文,此不赘述!
stepwise(bwt~age+lwt+race+smoke+ptl+ht+ui+ftv,data=fitdata,selection="score",select="AIC",best=2)
lmfit<- lm(bwt~age+lwt+race+smoke+ptl+ht+ui,data=fitdata)
summary(lmfit)
如果我们以新生儿是否是低体重出生儿(low:0=正常,1=低体重出生儿)为因变量,可以建立二分类的logistic回归,变量筛选的逐步回归如下:
stepwiseLogit(low~age+lwt+race+smoke+ptl+ht+ui+ftv,data=fitdata,selection="bidirection",select="SL",sle=0.05,sls=0.1,sigMethod="LRT")
lgrfit<-glm(low~lwt+ptl+ht, family=binomial(link = logit), data=fitdata)
summary(lgrfit)
confint(lgrfit) ##给出模型参数的置信区间
exp(cbind(OR=coef(lgrfit),confint(lgrfit))) ##给出模型参数的OR和置信区间
OR 2.5 % 97.5 %
(Intercept) 2.2366523 0.4530008 12.4304398
lwt 0.9846164 0.9711283 0.9968232
ptlYes 4.0705166 1.7788055 9.6045086
htYes 6.1852152 1.6008819 27.3558354
原创不易,欢迎“在看”
关注“一统浆糊”
获取更多信息