mlr3:pipelines
mlr3pipelines
是一种数据流编程套件,完整的机器学习工作流可被称为Graph/Pipelines,包含数据预处理、建模、多个模型比较等,不同的模型需要不同的数据预处理方法,另外还有集成学习、各种非线性模型等,这些都可以通过mlr3pipelines
解决。
数据预处理的R包有很多,比如caret
、recipes
等,mlr3pipelines
创造性的使用了图流的方式。
pipeops
进行数据预处理的各种方法在mlr3pipelines
中被称为pipeops
,目前基本涵盖常见的数据预处理方法,比如独热编码、稀疏矩阵、缺失值处理、降维、数据标准化、因子分组等等。
可以用来连接预处理和模型,或者构建复杂的统计建模步骤,例如多种不同的预处理连接多种不同的模型等
查看所有的pipeops
library(mlr3pipelines)
as.data.table(mlr_pipeops) # 目前共有64种
## key packages
## 1: boxcox mlr3pipelines,bestNormalize
## 2: branch mlr3pipelines
## 3: chunk mlr3pipelines
## 4: classbalancing mlr3pipelines
## 5: classifavg mlr3pipelines,stats
## 6: classweights mlr3pipelines
## 7: colapply mlr3pipelines
## 8: collapsefactors mlr3pipelines
## 9: colroles mlr3pipelines
## 10: copy mlr3pipelines
## 11: datefeatures mlr3pipelines
## 12: encode mlr3pipelines,stats
## 13: encodeimpact mlr3pipelines
## 14: encodelmer mlr3pipelines,lme4,nloptr
## 15: featureunion mlr3pipelines
## 16: filter mlr3pipelines
## 17: fixfactors mlr3pipelines
## 18: histbin mlr3pipelines,graphics
## 19: ica mlr3pipelines,fastICA
## 20: imputeconstant mlr3pipelines
## 21: imputehist mlr3pipelines,graphics
## 22: imputelearner mlr3pipelines
## 23: imputemean mlr3pipelines
## 24: imputemedian mlr3pipelines,stats
## 25: imputemode mlr3pipelines
## 26: imputeoor mlr3pipelines
## 27: imputesample mlr3pipelines
## 28: kernelpca mlr3pipelines,kernlab
## 29: learner mlr3pipelines
## 30: learner_cv mlr3pipelines
## 31: missind mlr3pipelines
## 32: modelmatrix mlr3pipelines,stats
## 33: multiplicityexply mlr3pipelines
## 34: multiplicityimply mlr3pipelines
## 35: mutate mlr3pipelines
## 36: nmf mlr3pipelines,MASS,NMF
## 37: nop mlr3pipelines
## 38: ovrsplit mlr3pipelines
## 39: ovrunite mlr3pipelines
## 40: pca mlr3pipelines
## 41: proxy mlr3pipelines
## 42: quantilebin mlr3pipelines,stats
## 43: randomprojection mlr3pipelines
## 44: randomresponse mlr3pipelines
## 45: regravg mlr3pipelines
## 46: removeconstants mlr3pipelines
## 47: renamecolumns mlr3pipelines
## 48: replicate mlr3pipelines
## 49: scale mlr3pipelines
## 50: scalemaxabs mlr3pipelines
## 51: scalerange mlr3pipelines
## 52: select mlr3pipelines
## 53: smote mlr3pipelines,smotefamily
## 54: spatialsign mlr3pipelines
## 55: subsample mlr3pipelines
## 56: targetinvert mlr3pipelines
## 57: targetmutate mlr3pipelines
## 58: targettrafoscalerange mlr3pipelines
## 59: textvectorizer mlr3pipelines,quanteda,stopwords
## 60: threshold mlr3pipelines
## 61: tunethreshold mlr3pipelines,bbotk
## 62: unbranch mlr3pipelines
## 63: vtreat mlr3pipelines,vtreat
## 64: yeojohnson mlr3pipelines,bestNormalize
## key packages
## tags
## 1: data transform
## 2: meta
## 3: meta
## 4: imbalanced data,data transform
## 5: ensemble
## 6: imbalanced data,data transform
## 7: data transform
## 8: data transform
## 9: data transform
## 10: meta
## 11: data transform
## 12: encode,data transform
## 13: encode,data transform
## 14: encode,data transform
## 15: ensemble
## 16: feature selection,data transform
## 17: robustify,data transform
## 18: data transform
## 19: data transform
## 20: missings
## 21: missings
## 22: missings
## 23: missings
## 24: missings
## 25: missings
## 26: missings
## 27: missings
## 28: data transform
## 29: learner
## 30: learner,ensemble,data transform
## 31: missings,data transform
## 32: data transform
## 33: multiplicity
## 34: multiplicity
## 35: data transform
## 36: data transform
## 37: meta
## 38: target transform,multiplicity
## 39: multiplicity,ensemble
## 40: data transform
## 41: meta
## 42: data transform
## 43: data transform
## 44: abstract
## 45: ensemble
## 46: robustify,data transform
## 47: data transform
## 48: multiplicity
## 49: data transform
## 50: data transform
## 51: data transform
## 52: feature selection,data transform
## 53: imbalanced data,data transform
## 54: data transform
## 55: data transform
## 56: abstract
## 57: target transform
## 58: target transform
## 59: data transform
## 60: target transform
## 61: target transform
## 62: meta
## 63: encode,missings,data transform
## 64: data transform
## tags
## feature_types input.num output.num
## 1: numeric,integer 1 1
## 2: NA 1 NA
## 3: NA 1 NA
## 4: logical,integer,numeric,character,factor,ordered,... 1 1
## 5: NA NA 1
## 6: logical,integer,numeric,character,factor,ordered,... 1 1
## 7: logical,integer,numeric,character,factor,ordered,... 1 1
## 8: factor,ordered 1 1
## 9: logical,integer,numeric,character,factor,ordered,... 1 1
## 10: NA 1 NA
## 11: POSIXct 1 1
## 12: factor,ordered 1 1
## 13: factor,ordered 1 1
## 14: factor,ordered 1 1
## 15: NA NA 1
## 16: logical,integer,numeric,character,factor,ordered,... 1 1
## 17: factor,ordered 1 1
## 18: numeric,integer 1 1
## 19: numeric,integer 1 1
## 20: logical,integer,numeric,character,factor,ordered,... 1 1
## 21: integer,numeric 1 1
## 22: logical,factor,ordered 1 1
## 23: numeric,integer 1 1
## 24: numeric,integer 1 1
## 25: factor,integer,logical,numeric,ordered 1 1
## 26: character,factor,integer,numeric,ordered 1 1
## 27: factor,integer,logical,numeric,ordered 1 1
## 28: numeric,integer 1 1
## 29: NA 1 1
## 30: logical,integer,numeric,character,factor,ordered,... 1 1
## 31: logical,integer,numeric,character,factor,ordered,... 1 1
## 32: logical,integer,numeric,character,factor,ordered,... 1 1
## 33: NA 1 NA
## 34: NA NA 1
## 35: logical,integer,numeric,character,factor,ordered,... 1 1
## 36: numeric,integer 1 1
## 37: NA 1 1
## 38: NA 1 1
## 39: NA 1 1
## 40: numeric,integer 1 1
## 41: NA NA 1
## 42: numeric,integer 1 1
## 43: numeric,integer 1 1
## 44: NA 1 1
## 45: NA NA 1
## 46: logical,integer,numeric,character,factor,ordered,... 1 1
## 47: logical,integer,numeric,character,factor,ordered,... 1 1
## 48: NA 1 1
## 49: numeric,integer 1 1
## 50: numeric,integer 1 1
## 51: numeric,integer 1 1
## 52: logical,integer,numeric,character,factor,ordered,... 1 1
## 53: logical,integer,numeric,character,factor,ordered,... 1 1
## 54: numeric,integer 1 1
## 55: logical,integer,numeric,character,factor,ordered,... 1 1
## 56: NA 2 1
## 57: NA 1 2
## 58: NA 1 2
## 59: character 1 1
## 60: NA 1 1
## 61: NA 1 1
## 62: NA NA 1
## 63: logical,integer,numeric,character,factor,ordered,... 1 1
## 64: numeric,integer 1 1
## feature_types input.num output.num
## input.type.train input.type.predict output.type.train output.type.predict
## 1: Task Task Task Task
## 2: * * * *
## 3: Task Task Task Task
## 4: TaskClassif TaskClassif TaskClassif TaskClassif
## 5: NULL PredictionClassif NULL PredictionClassif
## 6: TaskClassif TaskClassif TaskClassif TaskClassif
## 7: Task Task Task Task
## 8: Task Task Task Task
## 9: Task Task Task Task
## 10: * * * *
## 11: Task Task Task Task
## 12: Task Task Task Task
## 13: Task Task Task Task
## 14: Task Task Task Task
## 15: Task Task Task Task
## 16: Task Task Task Task
## 17: Task Task Task Task
## 18: Task Task Task Task
## 19: Task Task Task Task
## 20: Task Task Task Task
## 21: Task Task Task Task
## 22: Task Task Task Task
## 23: Task Task Task Task
## 24: Task Task Task Task
## 25: Task Task Task Task
## 26: Task Task Task Task
## 27: Task Task Task Task
## 28: Task Task Task Task
## 29: TaskClassif TaskClassif NULL PredictionClassif
## 30: TaskClassif TaskClassif TaskClassif TaskClassif
## 31: Task Task Task Task
## 32: Task Task Task Task
## 33: [*] [*] * *
## 34: * * [*] [*]
## 35: Task Task Task Task
## 36: Task Task Task Task
## 37: * * * *
## 38: TaskClassif TaskClassif [TaskClassif] [TaskClassif]
## 39: [NULL] [PredictionClassif] NULL PredictionClassif
## 40: Task Task Task Task
## 41: * * * *
## 42: Task Task Task Task
## 43: Task Task Task Task
## 44: NULL Prediction NULL Prediction
## 45: NULL PredictionRegr NULL PredictionRegr
## 46: Task Task Task Task
## 47: Task Task Task Task
## 48: * * [*] [*]
## 49: Task Task Task Task
## 50: Task Task Task Task
## 51: Task Task Task Task
## 52: Task Task Task Task
## 53: Task Task Task Task
## 54: Task Task Task Task
## 55: Task Task Task Task
## 56: NULL,NULL function,Prediction NULL Prediction
## 57: Task Task NULL,Task function,Task
## 58: TaskRegr TaskRegr NULL,TaskRegr function,TaskRegr
## 59: Task Task Task Task
## 60: NULL PredictionClassif NULL PredictionClassif
## 61: Task Task NULL Prediction
## 62: * * * *
## 63: Task Task Task Task
## 64: Task Task Task Task
## input.type.train input.type.predict output.type.train output.type.predict
看到有很多数据预处理方法了,但其实常用的也就10来种左右。
创建预处理步骤可通过以下方法:
pca <- mlr_pipeops$get("pca")
# 或者用简便写法
pca <- po("pca")
非常重要的一点是,不仅能创建预处理步骤,也可以用这种方法选择算法,选择特征选择方法等:
# 选择学习器/算法
library(mlr3)
learner <- po("learner" ,lrn("classif.rpart"))
# 选择特征选择的方法并设置参数
filter <- po("filter",
filter = mlr3filters::flt("variance"),
filter.frac = 0.5
)
mlr3pipelines
中的管道符: %>>%
这是mlr3
团队发明的专用管道符,可用于连接不同的预处理步骤、预处理和模型等操作:
gr <- po("scale") %>>% po("pca")
gr$plot(html = F)
很多强大的操作都是基于此管道符运行的。
建立模型
一个简单的例子,先预处理数据,再训练
# 连接预处理和模型,有点类似tidymodels的workflow
mutate <- po("mutate")
filter <- po("filter",
filter = mlr3filters::flt("variance"),
param_vals = list(filter.frac = 0.5))
graph <- mutate %>>%
filter %>>%
po("learner", learner = lrn("classif.rpart"))
现在这个graph
就变成了一个含有预处理步骤的学习器(learner),可以像前面介绍的那样直接用于训练、预测:
task <- tsk("iris")
graph$train(task)
## $classif.rpart.output
## NULL
预测
graph$predict(task)
## $classif.rpart.output
## <PredictionClassif> for 150 observations:
## row_ids truth response
## 1 setosa setosa
## 2 setosa setosa
## 3 setosa setosa
## ---
## 148 virginica virginica
## 149 virginica virginica
## 150 virginica virginica
除此之外,还可以把graph
变成一个GraphLearner
对象,用于resample
和benchmark
等
glrn <- as_learner(graph) # 变成graphlearner
cv3 <- rsmp("cv", folds = 5)
resample(task, glrn, cv3)
## INFO [21:01:53.145] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 2/5)
## INFO [21:01:53.230] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 5/5)
## INFO [21:01:53.291] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 1/5)
## INFO [21:01:53.350] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 4/5)
## INFO [21:01:53.407] [mlr3] Applying learner 'mutate.variance.classif.rpart' on task 'iris' (iter 3/5)
## <ResampleResult> of 5 iterations
## * Task: iris
## * Learner: mutate.variance.classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
在很多数据预处理步骤中也是有参数需要调整的,mlr3pipelines
不仅可以用于调整算法的超参数,还可以调整预处理中的参数。
library(paradox)
ps <- ps(
classif.rpart.cp = p_dbl(0,0.05), # 算法中的参数
variance.filter.frac = p_dbl(0.25,1) # 特征选择方法中的参数
)
library(mlr3tuning)
instance <- TuningInstanceSingleCrit$new(
task = task,
learner = glrn,
resampling = rsmp("holdout", ratio = 0.7),
measure = msr("classif.acc"),
search_space = ps,
terminator = trm("evals", n_evals = 20)
)
tuner <- tnr("random_search")
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
tuner$optimize(instance)
## classif.rpart.cp variance.filter.frac learner_param_vals x_domain
## 1: 0.02162802 0.3852356 <list[5]> <list[2]>
## classif.acc
## 1: 0.9777778
instance$result_y
## classif.acc
## 0.9777778
instance$result_learner_param_vals
## $mutate.mutation
## list()
##
## $mutate.delete_originals
## [1] FALSE
##
## $variance.filter.frac
## [1] 0.3852356
##
## $classif.rpart.xval
## [1] 0
##
## $classif.rpart.cp
## [1] 0.02162802
可以看到结果直接给出了算法的超参数和特征选择中的参数。
非线性graph
Branching: 一个点通往多个分支,例如在比较多个特征选择方法时很有用。只有一条路会被执行。 Copying: 一个点通往多个分支,所有的分支都会被执行,但是只能1次执行1个分支,并行计算目前还不支持。 Stacking: 单个图被彼此堆叠,一个图的输出是另一个图的输入。
branching & copying
使用PipeOpBranch
和PipeOpUnbranch
实现分支操作,分支操作的概念如下图所示:
下面一个例子演示了分支操作,分支之后一定要解除分支:
graph <- po("branch", c("nop","pca","scale")) %>>% # 开始分支
gunion(list(
po("nop", id = "null1"), # 分支1,并且取了个名字null1
po("pca"), # 分支2
po("scale") # 分支3
)) %>>%
po("unbranch",c("nop","pca","scale")) # 结束分支
graph$plot(html = F)
bagging
属于集成学习的一种,概念不做介绍,感兴趣的可自行学习,其概念可查看下图:
下面演示基本使用方法。
single_pred <- po("subsample", frac = 0.7) %>>%
po("learner", lrn("classif.rpart")) # 建立一个模型
pred_set <- ppl("greplicate", single_pred, 10L) # 复制10次
bagging <- pred_set %>>%
po("classifavg", innum = 10)
bagging$plot(html = FALSE)
把上面的对象变成一个GraphLearner
,然后就可以进行训练和预测了:
task <- tsk("iris")
split <- partition(task, ratio = 0.7, stratify = T)
baglrn <- as_learner(bagging)
baglrn$train(task, row_ids = split$train)
baglrn$predict(task, row_ids = split$test)
## <PredictionClassif> for 45 observations:
## row_ids truth response prob.setosa prob.versicolor prob.virginica
## 4 setosa setosa 1 0 0
## 6 setosa setosa 1 0 0
## 8 setosa setosa 1 0 0
## ---
## 141 virginica virginica 0 0 1
## 147 virginica virginica 0 0 1
## 150 virginica virginica 0 0 1
stacking
另一种提高模型性能的方法,概念可看下图:
这里为了防止过拟合,使用PipeOpLearnerCV
预测袋外数据,它可以在数据内部自动执行嵌套重抽样。
首先创建level 0学习器,然后复制一份,并取一个名字:
lrn <- lrn("classif.rpart")
lrn_0 <- po("learner_cv", lrn$clone())
lrn_0$id<- "rpart_cv"
然后联合使用gunion
和PipeOpNOP
,把没动过的task
传到下一个level,这样经过决策树的task和没处理过的task就能一起传到下一个level了。
level_0 <- gunion(list(lrn_0, po("nop")))
把上面传下来的东西联合到一起:
combined <- level_0 %>>% po("featureunion", 2)
stack <- combined %>>% po("learner", lrn$clone())
stack$plot(html = FALSE)
然后就可以进行训练、预测了:
stacklrn <- as_learner(stack)
stacklrn$train(task, split$train)
stacklrn$predict(task, split$test)
## <PredictionClassif> for 45 observations:
## row_ids truth response
## 4 setosa setosa
## 6 setosa setosa
## 8 setosa setosa
## ---
## 141 virginica virginica
## 147 virginica virginica
## 150 virginica virginica
一个超级复杂的例子
这个例子有多个不同的预处理步骤,使用多个不同的算法。
library("magrittr")
library("mlr3learners")
rprt = lrn("classif.rpart", predict_type = "prob")
glmn = lrn("classif.glmnet", predict_type = "prob")
# 创建学习器
lrn_0 = po("learner_cv", rprt, id = "rpart_cv_1")
lrn_0$param_set$values$maxdepth = 5L
lrn_1 = po("pca", id = "pca1") %>>% po("learner_cv", rprt, id = "rpart_cv_2")
lrn_1$param_set$values$rpart_cv_2.maxdepth = 1L
lrn_2 = po("pca", id = "pca2") %>>% po("learner_cv", glmn)
# 第0层
level_0 = gunion(list(lrn_0, lrn_1, lrn_2, po("nop", id = "NOP1")))
# 第1层
level_1 = level_0 %>>%
po("featureunion", 4) %>>%
po("copy", 3) %>>%
gunion(list(
po("learner_cv", rprt, id = "rpart_cv_l1"),
po("learner_cv", glmn, id = "glmnt_cv_l1"),
po("nop", id = "NOP_l1")
))
# 第2层
level_2 = level_1 %>>%
po("featureunion", 3, id = "u2") %>>%
po("learner", rprt, id = "rpart_l2")
level_2$plot(html = FALSE)
下面就可以进行训练、预测:
task = tsk("iris")
lrn = as_learner(level_2)
lrn$
train(task, split$train)$
predict(task, split$test)$
score()
## classif.ce
## 0.08888889
一些特殊预处理步骤
其实是一些很常用的步骤...
缺失值处理:PipeOpImpute
缺失值处理实在是太常见了,mlr3pipelines
对于数值型和因子型都能处理。
pom <- po("missind")
pon <- po("imputehist", # 条形图插补数值型
id = "impute_num", # 取个名字
affect_columns = is.numeric # 设置处理哪些列
)
pof = po("imputeoor", id = "imputer_fct", affect_columns = is.factor) # 处理因子
imputer = pom %>>% pon %>>% pof
连接学习器:
polrn <- po("learner", lrn("classif.rpart"))
lrn <- as_learner(imputer %>>% polrn)
创建新的变量:PipeOpMutate
pom <- po("mutate",
mutation = list(
Sepal.Sum = ~ Sepal.Length + Sepal.Width,
Petal.Sum = ~ Petal.Length + Petal.Width,
Sepal.Petal.Ratio = ~ (Sepal.Length / Petal.Length)
)
)
使用子集训练:PipeOpChunk
有时候数据集太大,把数据分割成小块进行分块训练是很好的办法。
chks = po("chunk", 4)
lrns = ppl("greplicate", po("learner", lrn("classif.rpart")), 4)
mjv = po("classifavg", 4)
pipeline = chks %>>% lrns %>>% mjv
pipeline$plot(html = FALSE)
task = tsk("iris")
train.idx = sample(seq_len(task$nrow), 120)
test.idx = setdiff(seq_len(task$nrow), train.idx)
pipelrn = as_learner(pipeline)
pipelrn$train(task, train.idx)$
predict(task, train.idx)$
score()
## classif.ce
## 0.3333333
特征选择:PipeOpFilter
和PipeOpSelect
可以使用PipeOpFilter
对象把mlr3filters
里面的变量选择方法放进mlr3pipelines
中。
po("filter", mlr3filters::flt("information_gain"))
## PipeOp: <information_gain> (not trained)
## values: <list()>
## Input channels <name [train type, predict type]>:
## input [Task,Task]
## Output channels <name [train type, predict type]>:
## output [Task,Task]
可使用filter_nfeat/filter_frac/filter_cutoff
决定保留哪些变量/特征。
以上就是今天的内容,希望对你有帮助哦!欢迎点赞、在看、关注、转发!
欢迎在评论区留言或直接添加我的微信!
欢迎关注我的公众号:医学和生信笔记
“医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!