数据挖掘从入门到放弃(三):朴素贝叶斯
“
朴素贝叶斯是一种常用的分类算法,适用于维度非常高的数据集,具有速度快,可调参数少有点,非常适合为分类问题提供快速粗糙的基本方案,经常用于垃圾邮件分类等场景中,相同内容更新:https://blog.csdn.net/yezonggang。”
朴素贝叶斯算法
拉普拉斯修正
由于特征空间较为稀疏,因此,常常会出现概率为0的情况,在这种情况下,需要对其进行一些修正。常用的修正方法是拉普拉斯修正法,就是使得计算条件概率时候分子+1,很容易理解;
蘑菇数据集
import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport numpy as np # 修改baseUrl的路径即可完成数据读取修改baseUrl="C:\\Users\\71781\\Desktop\\2020\\ML-20200422\\bayes\\"mushrooms=pd.read_csv(baseUrl+"mushrooms.csv")mushrooms.columns=['class','cap-shape','cap-surface','cap-color','ruises','odor','gill-attachment','gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color','ring-number','ring-type','spore-print-color','population','habitat']mushrooms.shapepd.set_option("display.max_columns",100) #让所有列都能加载出来mushrooms.head()# 可以发现,所有特征都是离散的,都属于分类型# class标识有毒无毒np.unique(mushrooms['cap-shape'])fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))# 探究 形状和颜色对于是否有毒的贡献度,发现形状为b的无毒蘑菇比例大sns.countplot(x='cap-shape',data=mushrooms,hue='class',ax=ax1)sns.countplot(x='cap-surface',data=mushrooms,hue='class',ax=ax2)sns.countplot(x='cap-color',hue='class',data=mushrooms)# 把有毒无毒换成0/1类型,1标识无毒mushrooms['class'].replace('e',1,inplace=True)mushrooms['class'].replace('p',0,inplace=True)# 计算每个颜色无毒的概率perc = mushrooms[["cap-color", "class"]].groupby(['cap-color'],as_index=False).mean()percsns.barplot(x='cap-color',y='class',data=perc)# 使用sklearn进行预处理from sklearn.preprocessing import LabelEncoderlabelencoder=LabelEncoder()for col in mushrooms.columns: mushrooms[col] = labelencoder.fit_transform(mushrooms[col]) mushrooms.head()sns.countplot(x='cap-shape',data=mushrooms,hue='class',)X=mushrooms.drop('class',axis=1) #Predictorsy=mushrooms['class'] #Response#X.head() #这里采用用哑变量编码,为的是后面能更好的计算特征的各属性的重要性,并且避免数值变量分类时偏向于数值大的属性X=pd.get_dummies(X,columns=X.columns,drop_first=True)X.head()from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)# 贝叶斯from sklearn.naive_bayes import GaussianNBfrom sklearn import metrics model2 = GaussianNB()model2.fit(X_train, y_train)prediction2 = model2.predict(X_test)print('The accuracy of the Decision Tree is: {0}'.format(metrics.accuracy_score(prediction2,y_test)))