其他
基于Caret和RandomForest包进行随机森林分析的一般步骤 (1)
Caret构建机器学习流程的一般步骤
Caret依赖trainControl
函数设置交叉验证参数,train
函数具体训练和评估模型。首先是选择一系列需要评估的参数和参数值的组合,然后设置重采样评估方式,循环训练模型评估结果、计算模型的平均性能,根据设定的度量值选择最好的模型参数组合,使用全部训练集和最优参数组合完成模型的最终训练。
基于Caret和RandomForest包进行随机森林分析的一般步骤
createDataPartition
是拆分数据为训练集和测试集的函数。对于分类数据,按照每个类的大小成比例拆分;如果是回归数据,则先把响应值分为n
个区间,再成比例拆分。
# 拆分数据为测试集和训练集
seed <- 1
set.seed(seed)
train_index <- createDataPartition(metadata[[group]], p=0.75, list=F)
train_data <- expr_mat[train_index,]
train_data_group <- metadata[[group]][train_index]
test_data <- expr_mat[-train_index,]
test_data_group <- metadata[[group]][-train_index]
# Create model with default parameters
trControl <- trainControl(method="repeatedcv", number=10, repeats=3)
# train model<
span style="color: rgb(51, 51, 51);font-weight: bold;">if(file.exists('rda/rf_default.rda')){
rf_default <- readRDS("rda/rf_default.rda")
} else { # 设置随机数种子,使得结果可重复
seed <- 1
set.seed(seed)
rf_default <- train(x=train_data, y=train_data_group, method="rf",
trControl=trControl)
saveRDS(rf_default, "rda/rf_default.rda")
}
print(rf_default)
## Random Forest
##
## 59 samples
## 7070 predictors
## 2 classes: 'DLBCL', 'FL'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7707937 0.09159664
## 118 0.8914286 0.62016807
## 7069 0.9390476 0.77675070
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7069.
精确性随默认调参的变化
plot(rf_default)
Kappa值随默认调参的变化
# ?plot.train查看更多使用信息plot(rf_default, metric="Kappa")
#str(rf_default)
Caret比较不同算法的性能
Caret包的流程统一性在这就体现出来了,我之前没有看过ranger
和parRF
包,也不知道他们的具体使用。但我可以借助Caret
很快地用他们构建一个初步模型,并与randomForest
的结果进行比较。
# Too slow
# RRF: Regularized Random Forest
if(file.exists('rda/RRF_default.rda')){
RRF_default <- readRDS("rda/RRF_default.rda")
} else {
set.seed(1)
RRF_default <- train(x=train_data, y=train_data_group, method="RRF",
trControl=trControl)
saveRDS(RRF_default, "rda/RRF_default.rda")
}
RRF_default
## Regularized Random Forest
##
## 59 samples
## 7070 predictors
## 2 classes: 'DLBCL', 'FL'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ...
## Resampling results across tuning parameters:
##
## mtry coefReg coefImp Accuracy Kappa
## 2 0.010 0.0 0.6612698 0.06472400
## 2 0.010 0.5 0.6300000 -0.02007385
## 2 0.010 1.0 0.7317460 0.26162083
## 2 0.505 0.0 0.6984127 0.08957347
## 2 0.505 0.5 0.7711111 0.33546603
## 2 0.505 1.0 0.7436508 0.28044267
## 2 1.000 0.0 0.8269841 0.48410091
## 2 1.000 0.5 0.8450794 0.47661766
## 2 1.000 1.0 0.7195238 0.22447669
## 118 0.010 0.0 0.6668254 0.06598972
## 118 0.010 0.5 0.7785714 0.36287284
## 118 0.010 1.0 0.8485714 0.55917306
## 118 0.505 0.0 0.8036508 0.42708196
## 118 0.505 0.5 0.8422222 0.52480669
## 118 0.505 1.0 0.8388889 0.53413165
## 118 1.000 0.0 0.9061905 0.70898014
## 118 1.000 0.5 0.8550794 0.55650040
## 118 1.000 1.0 0.8101587 0.49044569
## 7069 0.010 0.0 0.7260317 0.29956225
## 7069 0.010 0.5 0.7250794 0.24852437
## 7069 0.010 1.0 0.7390476 0.28045078
## 7069 0.505 0.0 0.7982540 0.43476213
## 7069 0.505 0.5 0.7560317 0.31467344
## 7069 0.505 1.0 0.7457143 0.28034256
## 7069 1.000 0.0 0.8950794 0.65746499
## 7069 1.000 0.5 0.8600000 0.55006709
## 7069 1.000 1.0 0.7715873 0.34952193
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 118, coefReg = 1 and coefImp
## = 0.
ranger
是random forest
的快速版本,特别适用于高维数据,支持分类、回归和生存分析。
# ranger
# ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008).
if(file.exists('rda/ranger_default.rda')){
ranger_default <- readRDS("rda/ranger_default.rda")
} else {
set.seed(1)
ranger_default <- train(x=train_data, y=train_data_group, method="ranger",
trControl=trControl)
saveRDS(ranger_default, "rda/ranger_default.rda")
}
ranger_default
## Random Forest
##
## 59 samples
## 7070 predictors
## 2 classes: 'DLBCL', 'FL'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 52, 52, 54, 53, 53, 53, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.7626984 0.05238095
## 2 extratrees 0.7504762 0.00000000
## 118 gini 0.8777778 0.58852454
## 118 extratrees 0.8441270 0.42128852
## 7069 gini 0.9434921 0.78263305
## 7069 extratrees 0.9200000 0.73072829
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7069, splitrule = gini
## and min.node.size = 1.
parRF
是并行随机森林算法。
# parRF: Parallel Random Forest
if(file.exists('rda/parRF_default.rda')){
parRF_default <- readRDS("rda/parRF_default.rda")
} else { library(doParallel)
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
set.seed(1)
parRF_default <- train(x=train_data, y=train_data_group, method="parRF",
trControl=trControl) ## When you are done:
stopCluster(cl)
saveRDS(parRF_default, "rda/parRF_default.rda")
}
parRF_default
## Parallel Random Forest
##
## 59 samples
## 7070 predictors
## 2 classes: 'DLBCL', 'FL'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 52, 54, 54, 53, 52, 53, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7560317 0.01904762
## 118 0.8615873 0.53526865
## 7069 0.9161905 0.72790935
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7069.
合并三个包的模型结果
resamps <- resamples(list(RRF = RRF_default,
rf = rf_default,
parRF = parRF_default,
ranger = ranger_default))
resamps
##
## Call:
## resamples.default(x = list(RRF = RRF_default, rf = rf_default, parRF
## = parRF_default, ranger = ranger_default))
##
## Models: RRF, rf, parRF, ranger
## Number of resamples: 30
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit
比较绘制三个包的性能差异
theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
trellis.par.set(theme1)
# 这个结果时对时错,对Kappa值很高估,还没看什么原因
bwplot(resamps)
这个结果跟输出的矩阵是吻合的
dotplot(resamps)
统计检验评估模型之间差异是否显著
diff.value <- diff(resamps)
summary(diff.value)
##
## Call:
## summary.diff.resamples(object = diff.value)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## RRF rf parRF ranger
## RRF -0.032857 -0.010000 -0.037302
## rf 0.5193 0.022857 -0.004444
## parRF 1.0000 1.0000 -0.027302
## ranger 1.0000 1.0000 1.0000
##
## Kappa
## RRF rf parRF ranger
## RRF -0.067771 -0.018929 -0.073653
## rf 1 0.048841 -0.005882
## parRF 1 1 -0.054724
## ranger 1 1 1
Caret训练最终模型
if(file.exists('rda/rf_finaltest.rda')){
rf_final <- readRDS("rda/rf_finaltest.rda")
} else { # method="none" 不再抽样,用全部数据训练模型
trControl <- trainControl(method="none", classProbs = T) # 设置随机数种子,使得结果可重复
seed <- 1
set.seed(seed)
rf_final <- train(x=train_data, y=train_data_group, method="rf",
# 用最好的模型的调参值
tuneGrid = rf_default$bestTune,
trControl=trControl)
saveRDS(rf_final, "rda/rf_finaltest.rda")
}
print(rf_final)
## Random Forest
##
## 59 samples
## 7070 predictors
## 2 classes: 'DLBCL', 'FL'
##
## No pre-processing
## Resampling: None
基于模型对测试集进行预测
# ?predict.train
# type: raw (返回预测的类或回归的数值) or prob (返回分类到每个类的概率)
predict(rf_final, newdata=head(test_data))
## [1] DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL
## Levels: DLBCL FL
虽然都能被分类到DLBCL
,但分类概率是不同的,如DLBCL_16
就在边缘徘徊。
# ?predict.train
# type: raw (返回预测的类或回归的数值) or prob (返回分类到每个类的概率)
predict(rf_final, newdata=head(test_data), type="prob")
## DLBCL FL
## DLBCL_2 0.960 0.040
## DLBCL_5 0.934 0.066
## DLBCL_11 0.742 0.258
## DLBCL_13 0.892 0.108
## DLBCL_16 0.866 0.134
## DLBCL_17 0.838 0.162
获得模型结果评估矩阵(confusion matrix
)
predictions <- predict(rf_final, newdata=test_data)
confusionMatrix(predictions, test_data_group)
## Confusion Matrix and Statistics
##
## Reference
## Prediction DLBCL FL
## DLBCL 14 2
## FL 0 2
##
## Accuracy : 0.8889
## 95% CI : (0.6529, 0.9862)
## No Information Rate : 0.7778
## P-Value [Acc > NIR] : 0.2022
##
## Kappa : 0.6087
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.5000
## Pos Pred Value : 0.8750
## Neg Pred Value : 1.0000
## Prevalence : 0.7778
## Detection Rate : 0.7778
## Detection Prevalence : 0.8889
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : DLBCL
##
绘制ROC曲线
prediction_prob <- predict(rf_final, newdata=test_data, type="prob")library(pROC)
roc <- roc(test_data_group, prediction_prob[,1])
roc
##
## Call:
## roc.default(response = test_data_group, predictor = prediction_prob[, 1])
##
## Data: prediction_prob[, 1] in 14 controls (test_data_group DLBCL) > 4 cases (test_data_group FL).
## Area under the curve: 0.9821
ROC_data <- data.frame(FPR = 1- roc$specificities, TPR=roc$sensitivities)
ROC_data <- ROC_data[order(ROC_data$FPR),]
p <- ggplot(data=ROC_data, mapping=aes(x=FPR, y=TPR)) +
geom_step(color="red", size=1, direction = "mid") +
geom_segment(aes(x=0, xend=1, y=0, yend=1)) + theme_classic() +
xlab("False positive rate") +
ylab("True positive rate") + coord_fixed(1) + xlim(0,1) + ylim(0,1) +
annotate('text', x=0.5, y=0.25, label=paste('AUC=', round(roc$auc,2)))
p
机器学习系列教程
从随机森林开始,一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。
文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码,一步步理清各个环节和概念。
再到成熟代码应用、模型调参、模型比较、模型评估,学习整个机器学习需要用到的知识和技能。