机器学习丨决策树VS随机森林——应该使用哪种算法?(附代码&链接)
The following article is from 数据派THU Author Abhishek Sharma
本文转自公众号数据派,作者:Abhishek Sharma,翻译:陈超,校对:丁楠雅。本文以银行贷款数据为案例,对是否批准顾客贷款申请的决策过程进行了算法构建,并对比了决策树与随机森林两种机器学习算法之间的异同及各自的优劣。标签:算法,初学者,分类,机器学习,Python,结构化数据,监督
用一个简单的比喻来解释决策树 vs 随机森林
决策树简介 随机森林概览 随机森林和决策树的冲突(代码) 为什么随机森林优于决策树? 决策树vs随机森林——你应该在何时选择何种算法?
基于树的算法:从零开始的完整教程(R & Python)
从决策树开始(免费课程)
从零开始构建一个随机森林&理解真实世界的数据产品
随机森林超参数调优——一个初学者的指南
集成学习的综合指南(使用Python代码)
如何在机器学习中建立集成模型?( R代码)
我们将基于Analytics Vidhya’s DataHack平台的贷款预测数据集进行分析。这是一个二分类问题,我们需要基于特定的特征集来决定一个人是否可被批准获得贷款:
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
注:你可以去DataHack平台并在不同在线机器学习竞赛中与他人竞争,并且有机会获得令人兴奋的奖品:
https://datahack.analyticsvidhya.com/contest/all/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
准备好编程了吗?
第二步:数据预处理
我将使用特定的模式对类别变量中的缺失值进行插补,并且对连续型变量用平均值插补(每列分别插补)。我们也将对类别变量进行标签设置。你可以阅读以下文章来了解更多关于标签编码的内容:
https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
# Data Preprocessing and null values imputation
# Label Encoding
df['Gender']=df['Gender'].map({'Male':1,'Female':0})
df['Married']=df['Married'].map({'Yes':1,'No':0})
df['Education']=df['Education'].map({'Graduate':1,'Not Graduate':0})
df['Dependents'].replace('3+',3,inplace=True)
df['Self_Employed']=df['Self_Employed'].map({'Yes':1,'No':0})
df['Property_Area']=df['Property_Area'].map({'Semiurban':1,'Urban':2,'Rural':3})
df['Loan_Status']=df['Loan_Status'].map({'Y':1,'N':0})
#Null Value Imputation
rev_null=['Gender','Married','Dependents','Self_Employed','Credit_History','LoanAmount','Loan_Amount_Term']
df[rev_null]=df[rev_null].replace({np.nan:df['Gender'].mode(),
np.nan:df['Married'].mode(),
np.nan:df['Dependents'].mode(),
np.nan:df['Self_Employed'].mode(),
np.nan:df['Credit_History'].mode(),
np.nan:df['LoanAmount'].mean(),
np.nan:df['Loan_Amount_Term'].mean()})
rfc_vs_dt-2.py hosted with ❤ by GitHub
第三步:创造训练集和测试集
X=df.drop(columns=['Loan_ID','Loan_Status']).values
Y=df['Loan_Status'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)
rfc_vs_dt-3.py hosted with ❤ by GitHub
左右滑动查看更多
print('Shape of X_train=>',X_train.shape)
print('Shape of X_test=>',X_test.shape)
print('Shape of Y_train=>',Y_train.shape)
print('Shape of Y_test=>',Y_test.shape)
rfc_vs_dt-4.py hosted with ❤ by GitHub
左右滑动查看更多
第四步:构建和评估模型
# Building Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
dt.fit(X_train, Y_train)
dt_pred_train = dt.predict(X_train)
rfc_vs_dt-5.py hosted with ❤ by GitHub
左右滑动查看更多
你可以通过以下文章来了解更多关于F1-Score和其他的评估方法:
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
# Evaluation on Training set
dt_pred_train = dt.predict(X_train)
print('Training Set Evaluation F1-Score=>',f1_score(Y_train,dt_pred_train))
rfc_vs_dt-6.py hosted with ❤ by GitHub
# Evaluating on Test set
dt_pred_test = dt.predict(X_test)
print('Testing Set Evaluation F1-Score=>',f1_score(Y_test,dt_pred_test))
rfc_vs_dt-7.py hosted with ❤ by GitHub
左右滑动查看更多
建立随机森林模型
# Building Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion = 'entropy', random_state = 42)
rfc.fit(X_train, Y_train)
# Evaluating on Training set
rfc_pred_train = rfc.predict(X_train)
print('Training Set Evaluation F1-Score=>',f1_score(Y_train,rfc_pred_train))
rfc_vs_dt-8.py hosted with ❤ by GitHub
f1 score random forest
# Evaluating on Test set
rfc_pred_test = rfc.predict(X_test)
print('Testing Set Evaluation F1-Score=>',f1_score(Y_test,rfc_pred_test))
rfc_vs_dt-9.py hosted with ❤ by GitHub
左右滑动查看更多
feature_importance=pd.DataFrame({
'rfc':rfc.feature_importances_,
'dt':dt.feature_importances_
},index=df.drop(columns=['Loan_ID','Loan_Status']).columns)
feature_importance.sort_values(by='rfc',ascending=True,inplace=True)
index = np.arange(len(feature_importance))
fig, ax = plt.subplots(figsize=(18,8))
rfc_feature=ax.barh(index,feature_importance['rfc'],0.4,color='purple',label='Random Forest')
dt_feature=ax.barh(index+0.4,feature_importance['dt'],0.4,color='lightgreen',label='Decision Tree')
ax.set(yticks=index+0.4,yticklabels=feature_importance.index)
ax.legend()
plt.show()
rfc_vs_dt-10.py hosted with ❤ by GitHub
左右滑动查看更多
正如你在上图所见,决策树模型基于某一个特征集很高的重要性。但是随机森林算法在训练过程中随机选择特征。因此,的确不依赖于任何特定的特征集。这是随机森林算法优于bagging算法的一个特殊之处。你可以阅读以下文章获取更多bagging算法知识:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
决策树更容易解释和理解。因为随机森林整合了多棵决策树,更难以解释。但是好消息是——解释随机森林也并非不可能。这里有一篇文章讲了如何解释随机森林模型的结果:
▼
原文标题:
Decision Tree vs. Random Forest – Which Algorithm Should you Use?
https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/
·END·
点击阅读原文,进入新型农业经营主体大数据库
往期推荐
数据Seminar
这里是大数据、分析技术与学术研究的三叉路口
出处:数据派THU
推荐:青酱
排版编辑:青酱
欢迎扫描👇二维码添加关注