【课后作业】电信公司预测客户流失可能性

查看原文

其他

【课后作业】电信公司预测客户流失可能性

Ben Python爱好者社区 2019-04-07

作者：Ben 《Python数据科学技术详解与商业实践》作者

先后在亚信、德勤、百度等企业从事电信、金融行业数据挖掘工作。

这是《Python数据科学-技术详解与商业实践（第二期）》课程的课后作业

【课后作业】研究二手房价的影响因素，建立房价预测模型

电信公司希望针对客户的信息预测其流失可能性，数据存放在“telecom_churn.csv”中。

关注 Python爱好者社区回复电信即可获取数据。

1、分析思路：

在对客户流失与否的影响因素进行模型研究之前，首先对各解释变量与被解释变量进行两变量独立性分析，以初步判断影响流失的因素，进而建立客户

流失预测模型

主要变量说明如下：

#subscriberID="个人客户的ID"

#churn="是否流失：1=流失";

#Age="年龄"

#incomeCode="用户居住区域平均收入的代码"

#duration="在网时长"

#peakMinAv="统计期间内最高单月通话时长"

#peakMinDiff="统计期间结束月份与开始月份相比通话时长增加数量"

#posTrend="该用户通话时长是否呈现出上升态势：是=1"

#negTrend="该用户通话时长是否呈现出下降态势：是=1"

#nrProm="电话公司营销的数量"

#prom="最近一个月是否被营销过：是=1"

#curPlan="统计时间开始时套餐类型：1=最高通过200分钟；2=300分钟；3=350分钟；4=500分钟"

#avPlan="统计期间内平均套餐类型"

#planChange="统计结束时和开始时套餐的变化：正值代表套餐档次提升，负值代表下降，0代表不变"

#posPlanChange="统计期间是否提高套餐：1=是"

#negPlanChange="统计期间是否降低套餐：1=是"

#call_10086="拨打10086的次数"

2、作业安排：

2.1 基础知识：

1）作了一次营销活动，营销了1000人。事后统计结果，120人购买，其余人没有购买。请分别用矩估计法、极大似然估计法计算这个随机事件分布的参数（提示：该随机事件服从伯努利分布）

2）推导线性回归参数估计的最小二乘、矩估计、极大似然估计，推导逻辑回归的极大似然估计公式。线性回归和逻辑回归的极大似然法哪个可以得到

显性的公式解，哪个需要使用迭代法求解？解释极大似然法求解过程中用到的牛顿迭代法、随机梯度法的做法。

3）解释统计学习算法中超参的概念，请问目前统计方法中学习的线性回归、逻辑回归中涉及超参了吗？岭回归和Laso算法中的超参分别是什么？超参的作用是什么？统计学习算法中如何确定最优超参的取值？

4）比较统计分析法和统计学习（即机器学习）得到最优模型的思路。

5)二分类模型中（比如逻辑回归）的评估模型优劣的决策类和排序类评估指标分别包括哪些指标?

2.2 案例解答步骤如下：

1）两变量分析：检验该用户通话时长是否呈现出上升态势(posTrend)对流失(churn) 是否有预测价值

2）首先将原始数据拆分为训练和测试数据集，使用训练数据集建立在网时长对流失的逻辑回归，使用测试数据集制作混淆矩阵（阈值为0.5），提供准

确性、召回率指标，提供ROC曲线和AUC。

3）使用向前逐步法从其它备选变量中选择变量，构建基于AIC的最优模型，绘制ROC曲线，同时检验模型的膨胀系数。

4）使用岭回归和Laso算法重建第三步中的模型，使用交叉验证法确定惩罚参数(C值)。并比较步骤四中Laso算法得到的模型和第三步得到的模型的差异。

2.2.1参考代码如下:

下面代码（包含详细注释）都送你啦~

赶紧实现一波吧！体会一波数据分析挖掘乐趣！

# In[1]:
import os
import numpy as np
from scipy import stats
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
#pd.set_option('display.max_columns', None)
os.chdir(r"D:\Python_Training\script_Python\8logistic\HW")
# 导入数据和数据清洗
# In[5]:
churn = pd.read_csv(r'telecom_churn.csv', skipinitialspace=True)
churn.head()
#1两变量分析：检验该用户通话时长是否呈现出上升态势(posTrend)对流失(churn) 是否有预测价值
# ##  分类变量的相关关系
#
# 交叉表
# In[6]:
cross_table = pd.crosstab(churn.posTrend, 
                         churn.churn, margins=True)
cross_table
# 列联表
# In[7]:
def percConvert(ser):
    return ser/float(ser[-1])
cross_table.apply(percConvert, axis=1)
# In[8]:
print('''chisq = %6.4f 
p-value = %6.4f
dof = %i 
expected_freq = %s'''  %stats.chi2_contingency(cross_table.iloc[:2, :2]))
#2首先将原始数据拆分为训练和测试数据集，使用训练数据集建立在网时长对流失的逻辑回归，使用测试数据集制作混淆矩阵
#（阈值为0.5），提供准确性、召回率指标，提供ROC曲线和AUC。
# ## 逻辑回归
# In[9]:
churn.plot(x='duration', y='churn', kind='scatter')
# •随机抽样，建立训练集与测试集
# In[10]:
train = churn.sample(frac=0.7, random_state=1234).copy()
test = churn[~ churn.index.isin(train.index)].copy()
print(' 训练集样本量: %i \n 测试集样本量: %i' %(len(train), len(test)))
# In[11]:
lg = smf.glm('churn ~ duration', data=train, 
             family=sm.families.Binomial(sm.families.links.logit)).fit()
lg.summary()
# 预测
# In[19]:
train['proba'] = lg.predict(train)
test['proba'] = lg.predict(test)
test['proba'].head()
# In[12]:
# ## 模型评估
# 
# 设定阈值
# In[20]:
test['prediction'] = (test['proba'] > 0.5).astype('int')
# 混淆矩阵
# In[22]:
pd.crosstab(test.churn, test.prediction, margins=True)
# - 计算准确率
# In[23]:
acc = sum(test['prediction'] == test['churn']) /np.float(len(test))
print('The accurancy is %.2f' %acc)
# In[25]:
for i in np.arange(0.1, 0.9, 0.1):
    prediction = (test['proba'] > i).astype('int')
    confusion_matrix = pd.crosstab(prediction,test.churn,
                                   margins = True)
    precision = confusion_matrix.ix[0, 0] /confusion_matrix.ix['All', 0]
    recall = confusion_matrix.ix[0, 0] / confusion_matrix.ix[0, 'All']
    Specificity = confusion_matrix.ix[1, 1] /confusion_matrix.ix[1,'All']
    f1_score = 2 * (precision * recall) / (precision + recall)
    print('threshold: %s, precision: %.2f, recall:%.2f ,Specificity:%.2f , f1_score:%.2f'%(i, precision, recall, Specificity,f1_score))
# - 绘制ROC曲线
# In[27]:
import sklearn.metrics as metrics
fpr_test, tpr_test, th_test = metrics.roc_curve(test.churn, test.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(train.churn, train.proba)
plt.figure(figsize=[3, 3])
plt.plot(fpr_test, tpr_test, 'b--')
plt.plot(fpr_train, tpr_train, '')
plt.title('ROC curve')
plt.show()
# In[28]:
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))
#3使用向前逐步法从其它备选变量中选择变量，构建基于AIC的最优模型，绘制ROC曲线，同时检验模型的膨胀系数。
# In[14]:
#- 多元逻辑回归
# 向前法
def forward_select(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    while remaining:
        aic_with_candidates=[]
        for candidate in remaining:
            formula = "{} ~ {}".format(
                response,' + '.join(selected + [candidate]))
            aic = smf.glm(
                formula=formula, data=data, 
                family=sm.families.Binomial(sm.families.links.logit)
            ).fit().aic
            aic_with_candidates.append((aic, candidate))
        aic_with_candidates.sort(reverse=True)
        best_new_score, best_candidate=aic_with_candidates.pop()
        if current_score > best_new_score: 
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            print ('aic is {},continuing!'.format(current_score))
        else:        
            print ('forward selection over!')
            break

    formula = "{} ~ {} ".format(response,' + '.join(selected))
    print('final formula is {}'.format(formula))
    model = smf.glm(
        formula=formula, data=data, 
        family=sm.families.Binomial(sm.families.links.logit)
    ).fit()
    return(model)
# In[16]:
candidates = ['churn','duration','AGE','edu_class','posTrend','negTrend','nrProm','prom','curPlan','avgplan','planChange','incomeCode','feton','peakMinAv','peakMinDiff','call_10086']
data_for_select = train[candidates]
lg_m1 = forward_select(data=data_for_select, response='churn')
lg_m1.summary()
# Seemingly wrong when using 'statsmmodels.stats.outliers_influence.variance_inflation_factor'
# In[17]:
def vif(df, col_i):
    from statsmodels.formula.api import ols

    cols = list(df.columns)
    cols.remove(col_i)
    cols_noti = cols
    formula = col_i + '~' + '+'.join(cols_noti)
    r2 = ols(formula, df).fit().rsquared
    return 1. / (1. - r2)
# In[18]:
exog = train[candidates].drop(['churn'], axis=1)
for i in exog.columns:
    print(i, '\t', vif(df=exog, col_i=i))
#posTrend,negTrend;curPlan,avgplan有明显的共线性问题,剔除其中两个后重新建模.
#%%
#4）使用岭回归和Laso算法重建第三步中的模型，使用交叉验证法确定惩罚参数(C值)。并比较步骤四中Laso算法得到的模型和第三步得到的模型的差异
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
candidates = ['duration','AGE','edu_class','posTrend','negTrend','nrProm','prom','curPlan','avgplan','planChange','incomeCode','feton','peakMinAv','peakMinDiff','call_10086']
#data_for_select = churn[candidates]
scaler = StandardScaler()  # 标准化
X = scaler.fit_transform(churn[candidates])
y = churn['churn']
#%%
from sklearn import linear_model
from sklearn.svm import l1_min_c
cs = l1_min_c(X, y, loss='log') * np.logspace(0, 4)
print("Computing regularization path ...")
#start = datetime.now()
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
coefs_ = []
for c in cs:
    clf.set_params(C=c)
    clf.fit(X, y)
    coefs_.append(clf.coef_.ravel().copy())
#print("This took ", datetime.now() - start)
coefs_ = np.array(coefs_)
plt.plot(np.log10(cs), coefs_)
ymin, ymax = plt.ylim()
plt.xlabel('log(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.show()
#%%
cs = l1_min_c(X, y, loss='log') * np.logspace(0, 4)
import matplotlib.pyplot as plt #可视化模块  
from sklearn.cross_validation import cross_val_score # K折交叉验证模块  

k_scores = []  
clf = linear_model.LogisticRegression(penalty='l1')
#藉由迭代的方式来计算不同参数对模型的影响，并返回交叉验证后的平均准确率  
for c in cs:  
    clf.set_params(C=c)
    scores = cross_val_score(clf, X, y, cv=10, scoring='roc_auc')  #http://scikit-learn.org/stable/modules/model_evaluation.html
    k_scores.append([c,scores.mean(),scores.std()])  
#%%  
#可视化数据  
#%%
data=pd.DataFrame(k_scores)#将字典转换成为数据框
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(np.log10(data[0]), data[1],'b')
ax1.set_ylabel('Mean ROC(Blue)')
ax1.set_xlabel('log10(cs)')
ax2 = ax1.twinx()
ax2.plot(np.log10(data[0]), data[2],'r')
ax2.set_ylabel('Std ROC Index(Red)')
#得到合理的C为 np.exp(-1.9)
#%%
#重新实现Laso算法
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
candidates = ['duration','AGE','edu_class','posTrend','negTrend','nrProm','prom','curPlan','avgplan','planChange','incomeCode','feton','peakMinAv','peakMinDiff','call_10086']
#data_for_select = churn[candidates]
scaler = StandardScaler()  # 标准化
X = scaler.fit_transform(churn[candidates])
y = churn['churn']
#%%
from sklearn import linear_model
clf = linear_model.LogisticRegression(C=np.exp(-1.9), penalty='l1')
clf.fit(X, y)
clf.coef_
#%%

关注 Python爱好者社区回复 电信 即可获取数据。

作者新书推荐：

如果想学习数据科学，数据分析挖掘，机器学习

可以看看以下这篇小编走心推荐的文章：

【不看后悔】这 skr 帮 homie 提升数据科学能力的好书，不服来battle！

点击文末阅读原文即可购书（7折优惠！）

如果觉得看书学习效率低下

可以考虑下图扫码学习配套课程

Python爱好者社区历史文章大合集：

Python爱好者社区历史文章列表（每周append更新一次）

福利：文末扫码立刻关注公众号，“Python爱好者社区”，开始学习Python课程：

关注后在公众号内回复“课程”即可获取：

小编的Python入门免费视频课程！！！

【最新免费微课】小编的Python快速上手matplotlib可视化库！！！

崔老师爬虫实战案例免费学习视频。

陈老师数据分析报告制作免费学习视频。

玩转大数据分析！Spark2.X+Python 精华实战课程免费学习视频。

点击阅读原文立即购买本书，现在购买只要7折！

”FAN某”的离婚财产分割判决书（全文）

”FAN某”的离婚财产分割判决书（全文）

哈里斯女粉搞4B运动、毒杀丈夫，回旋镖能否让美国“血流成河”

比国产光刻机更重要的IPO要来了！

这把绝对高端局，只有中国人才懂