天池入门赛—100行代码预测O2O优惠券使用情况
赛题背景
以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。 本次大赛为参赛选手提供了O2O场景相关的丰富数据,希望参赛选手通过分析建模,精准预测用户是否会在规定时间内使用相应优惠券。
数据
本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。本文代码仅使用两个数据表,一个是线下训练集数据,另一个是需完成预测的数据。表格具体信息介绍如下:
评价方式
本赛题目标是预测投放的优惠券是否核销。针对此任务及一些相关背景知识,使用优惠券核销预测的平均AUC(ROC曲线下面积)作为评价标准。 即对每个优惠券coupon_id单独计算核销预测的AUC值,再对所有优惠券的AUC值求平均作为最终的评价标准。
特征工程
导入数据:
1import numpy as np
2import pandas as pd
3#导入数据
4dfoff = pd.read_csv('ccf_offline_stage1_train.csv',keep_default_na=False)
5dftest = pd.read_csv('ccf_offline_stage1_test_revised.csv',keep_default_na=False)
6print ('线下数据:','\n',dfoff.head())
7print ('测试集数据:','\n',dftest.head())需要把满减类型改为折扣率,也就是将满xx减yy类型(xx:yy)的券变成折扣率 : 1 - yy/xx,建立折扣券相关的特征:discount_rate,discount_man,discount_jian,discount_type,同时将Discount_rate 和Distance转化为discount_rate 和distance。
1def getDiscountType(row):
2 if row == 'null':
3 return 'null'
4 elif ':' in row:
5 return 1
6 else:
7 return 0
8def convertRate(row):
9 """Convert discount to rate"""
10 if row == 'null':
11 return 1.0
12 elif ':' in row:
13 rows = row.split(':')
14 return 1.0 - float(rows[1])/float(rows[0])
15 else:
16 return float(row)
17def getDiscountMan(row):
18 if ':' in row:
19 rows = row.split(':')
20 return int(rows[0])
21 else:
22 return 0
23def getDiscountJian(row):
24 if ':' in row:
25 rows = row.split(':')
26 return int(rows[1])
27 else:
28 return 0
29def processData(df):
30 # convert discunt_rate
31 df['discount_rate'] = df['Discount_rate'].apply(convertRate)
32 df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
33 df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
34 df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
35 print(df['discount_rate'].unique())
36 # convert distance
37 df['distance'] = df['Distance'].replace('null', -1).astype(int)
38 print(df['distance'].unique())
39 return df
40dfoff = processData(dfoff)
41dftest = processData(dftest)
42print (dfoff.head())观察每天的顾客收到coupon的数目,以及收到coupon后用coupon消费的数目。
1date_received = dfoff['Date_received'].unique()
2date_received = sorted(date_received[date_received != 'null'])
3
4date_buy = dfoff['Date'].unique()
5date_buy = sorted(date_buy[date_buy != 'null'])
6
7date_buy = sorted(dfoff[dfoff['Date'] != 'null']['Date'])
8print('优惠券收到日期从',date_received[0],'到', date_received[-1])
9print('消费日期从', date_buy[0], '到', date_buy[-1])
10
11#每天的顾客收到coupon的数目,以及收到coupon后用coupon消费的数目
12couponbydate = dfoff[dfoff['Date_received'] != 'null'][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
13couponbydate.columns = ['Date_received','count']
14buybydate = dfoff[(dfoff['Date'] != 'null') & (dfoff['Date_received'] != 'null')][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
15buybydate.columns = ['Date_received','count']
16
17import matplotlib.pyplot as plt
18import seaborn as sns
19
20sns.set_style('ticks')
21sns.set_context("notebook", font_scale= 1.4)
22plt.figure(figsize = (12,8))
23date_received_dt = pd.to_datetime(date_received, format='%Y%m%d')
24plt.subplot(211)
25plt.bar(date_received_dt, couponbydate['count'], label = 'number of coupon received' )
26plt.bar(date_received_dt, buybydate['count'], label = 'number of coupon used')
27plt.yscale('log')
28plt.ylabel('Count')
29plt.legend()
30plt.subplot(212)
31plt.bar(date_received_dt, buybydate['count']/couponbydate['count'])
32plt.ylabel('Ratio(coupon used/coupon received)')
33plt.tight_layout()考虑到日期的变化以及消费情况星期几有关,增加新的特征。
1from datetime import date
2def getWeekday(row):
3 if row == 'null':
4 return row
5 else:
6 return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1
7
8dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
9dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)
10
11# weekday_type : 周六和周日为1,其他为0
12dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
13dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
14
15# change weekday to one-hot encoding 成哑变量
16weekdaycols = ['weekday_' + str(i) for i in range(1,8)]
17print(weekdaycols)
18
19tmpdf = pd.get_dummies(dfoff['weekday'].replace('null', np.nan))
20tmpdf.columns = weekdaycols
21dfoff[weekdaycols] = tmpdf
22 #dfoff.head()
23tmpdf = pd.get_dummies(dftest['weekday'].replace('null', np.nan))
24tmpdf.columns = weekdaycols
25dftest[weekdaycols] = tmpdf增加数据标注,增加一列label,15天用优惠券购买为正样本,未领优惠券为普通消费,其他为负样本。
1def label(row):
2 if row['Date_received'] == 'null':
3 return -1
4 if row['Date'] != 'null':
5 td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
6 if td <= pd.Timedelta(15, 'D'):
7 return 1
8 return 0
9dfoff['label'] = dfoff.apply(label, axis = 1)预测
train/valid 的划分:用20160101到20160515的作为训练集,20160516到20160615作为测试集。然后用SGDClassifier做一个简单的预测。
1df = dfoff[dfoff['label'] != -1].copy()
2df['discount_type'] = pd.to_numeric(df['discount_type'])
3df['weekday'] = pd.to_numeric(df['weekday'])
4train = df[(df['Date_received'] < '20160516')].copy()
5valid = df[(df['Date_received'] >= '20160516') & (df['Date_received'] <= '20160615')].copy()
6
7# feature
8original_feature = ['discount_rate','discount_type','discount_man', 'discount_jian','distance', 'weekday', 'weekday_type'] + weekdaycols
9predictors = original_feature
10
11from sklearn.model_selection import KFold, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
12from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve
13from sklearn.preprocessing import MinMaxScaler
14import xgboost as xgb
15
16#xgboost模型
17dtrain = xgb.DMatrix(train[predictors],label=train['label'])
18dtest=xgb.DMatrix(valid[predictors])
19
20params={'booster':'gbtree',
21 'objective': 'rank:pairwise',
22 'eval_metric':'auc',
23 'gamma':0.1,
24 'min_child_weight':1.1,
25 'max_depth':5,
26 'lambda':10,
27 'subsample':0.7,
28 'colsample_bytree':0.7,
29 'colsample_bylevel':0.7,
30 'eta': 0.01,
31 'tree_method':'exact',
32 'seed':0,
33 'nthread':12
34 }
35watchlist = [(dtrain,'train')]
36model = xgb.train(params,dtrain,num_boost_round=100,evals=watchlist)
37
38ypred=model.predict(dtest)
39from sklearn import metrics
40print ('AUC: %.4f' % metrics.roc_auc_score(valid['label'],ypred))
41
42# test prediction for submission
43ftest = xgb.DMatrix(dftest[predictors])
44y_xg_test_pred = model.predict(ftest)
45y_pred = MinMaxScaler().fit_transform(y_xg_test_pred.reshape(-1, 1))
结语
写完发现超过100行代码了,尴尬,那么画图那一块代码其实在预测中是没有用到的,大约就是100行代码
此文花费了不少功夫,赞赏、点赞、转发都是对作者的认可和支持。