该买哪家二手手机呢?程序员爬取京东告诉你!
The following article is from CSDN Author AJ Gordon
作者 | AJ Gordon
责编 | 李雪敬
数据获取
# 获取手机型号id
def get_mobile_model_id():
# 浏览器设置
option = webdriver.ChromeOptions()
# 防拦截
option.add_experimental_option('excludeSwitches', ['enable-automation'])
# 不加载图片
option.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
# 无界面
option.add_argument('--headless')
option.add_argument('--disable-gpu')
#
browser = webdriver.Chrome(options=option)
browser.get('https://list.jd.com/list.html?cat=13765%2C13767')
#获取浏览器当前打开页面的页面源码数据
page_text = browser.page_source
browser.quit()
# 获取手机型号ID
soup = BeautifulSoup(page_text,'lxml')
model_type = soup.find_all('ul',{'class':'J_valueList clearfix'})[1].find_all('li')
for i in model_type:
# 手机型号名称
# type = i.find('a').get_text()
# 手机型号id
type_id = i.find('a')['href'].split('ev=')[-1].split('&cid2=')[0]
redis_db.sadd('jd_mobile_model_id', type_id)
数据描述
图1 日期与销售量的关系
图2 时间与销售量的关系
图3 颜色与销量的关系
图4 价格与销量的关系
图5 品牌与销量的关系
图6 各品牌机型销量TOP3
图7 差评词云图
数据建模
1) 导入库和数据
import pandas as pd
import numpy as np
from scipy.special import boxcox1p,inv_boxcox1p
from sklearn.preprocessing import MinMaxScaler,StandardScaler,RobustScaler
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.metrics import make_scorer
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False #负号显示
def load_data():
data = pd.read_csv('result.csv',dtype={'skuId':str})
data = data[['skuId','old_new_degree', 'brand', 'model', 'color','version', 'Double_card_machine_type', 'Front_card_machine_type','Rear_camera_pixel','Battery_capacity', 'Running_memory','screen_size', 'price']].drop_duplicates('skuId')
return data
2) 清洗数据
对部分字段进行缺失值填充,以及将类似的分类合并为同一个。
def clean_data(data):
# 缺失值填充
data['model'].fillna('Missing', inplace=True)
data['color'].fillna('Missing', inplace=True)
# 修改字段
data['old_new_degree'] = data.apply(lambda x: str(x['old_new_degree']),axis=1)
data['version'] = data.apply(lambda x:'0' if x['version']=='Missing' else str(x['version']),axis=1)
data['Front_card_machine_type'] = data.apply(lambda x:'0' if x['Front_card_machine_type']=='Missing' else str(x['Front_card_machine_type'][:4].replace('万','')),axis=1)
data['Rear_camera_pixel'] = data.apply(lambda x:'0' if x['Rear_camera_pixel']=='Missing' else str(x['Rear_camera_pixel'][:4].replace('万','')),axis=1)
data['Battery_capacity'] = data.apply(lambda x:'0' if x['Battery_capacity']=='Missing' else str(x['Battery_capacity']),axis=1)
data['Running_memory'] = data.apply(lambda x:'0' if x['Running_memory']=='Missing' else str(x['Running_memory'].replace('GB','')),axis=1)
data['screen_size'] = data.apply(lambda x:'0' if x['screen_size']=='Missing' else str(x['screen_size'].replace('英寸','')),axis=1)
return data
3) 离散变量-独热编码
离散变量分为有序和无序两种变量,例如手机的内存越高越好,属于有序离散变量。颜色属于无序离散变量。这里我都是用pandas自带的get_dummy()进行独热编码,sklearn也有一个独热编码OnehotEncoder(),两者的区别在于get_dummy()无法适用于新类别,并且每次都要重新执行,只适用于数据量小的模型。
def get_dummy(df):
cols = ['version', 'Front_card_machine_type', 'Rear_camera_pixel', 'Battery_capacity', 'Running_memory','screen_size','old_new_degree', 'brand', 'model', 'color','Double_card_machine_type']
dummy_cols = df[cols].copy()
df = df.drop(cols,axis=1)
dummy_cols = pd.get_dummies(dummy_cols,prefix=cols)
df = pd.concat([df,dummy_cols],axis=1)
return df
4) 拆分数据
将原始数据集拆分成两部分:训练集和测试集(后100条),由于回归模型要求标签满足正态分布,所以对训练集的标签进行boxcox1p正态变换,使其满足正态分布。
# 拆分数据
all_rows = df.shape[0]
## 训练集
X_train = df[:all_rows-100]
y_train = X_train['price'].copy()
y_train = boxcox1p(y_train, 0)
X_train = X_train.drop(['skuId','price'],axis=1)
## 测试集
X_test = df[all_rows-100:]
y_test = X_test[['skuId','price']].copy()
X_test = X_test.drop(['skuId','price'],axis=1)
return X_train,y_train,X_test,y_test
5) 数据降维
由于独热编码后的特征会增加很多,所以需要进行降维。
def value_pca(X_train,X_test):
pca = PCA(n_components=0.9)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
#variance = pd.DataFrame(pca.explained_variance_ratio_)
#np.cumsum(pca.explained_variance_ratio_)
return X_train,X_test
6) 数据建模
将特征都处理好之后,就可以套用模型了,这里我用随机森林回归模型。并且用GridSearchCV()网格搜索,自定义RMSE作为其判断标准。最后用最佳参数进行预测,并用R2比较真实值和预测值的效果,R2越接近1效果越好,这次的模型R2值是0.912。
def model(X_train,y_train,X_test,y_test):
# 设置自定义评分函数
def my_custom_loss_func(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
rmse = make_scorer(my_custom_loss_func, greater_is_better=False) # 以_error结尾的函数,返回一个最小值,越小越好;如果使用make_scorer来创建scorer时,将greater_is_better设为False
# 设置自定义参数
rfr_param_test = {
'n_estimators': [10,20,30,40,50,60],
'max_depth': [5,6,7,8,9,10]}
# 进行网格搜索
grid_search = GridSearchCV(estimator=RandomForestRegressor(), param_grid=rfr_param_test, cv=5, scoring=rmse)
grid_search.fit(X_train,y_train)
print(grid_search.best_params_)
# 预测结果
rft_model = grid_search.best_estimator_
rft_model.fit(X_train, y_train)
y_pred = rft_model.predict(X_test)
y_pred = inv_boxcox1p(y_pred, 0)
# 输出R2值
R2 = r2_score(y_test['price'], y_pred)
print('R2:{}'.format(R2))
# 输出结果
result = pd.DataFrame({
'skuID':y_test['skuID'],
'price_old':y_test['price'],
'price_pred':y_pred})
result.to_csv('Regress_result.csv',index=False,encoding='utf_8_sig')
return result
总结
通过上述分析,我发现目前二手手机市场上,最热卖的是iphone,虽然国产机近几年的知名度越来越高,但是店家或者消费者都更倾向于iphone。此外,二手手机的价格越低,销量就会越高。但便宜所带来的弊端就是店家的售后差,屏幕效果不佳,电池损耗快等等。当你有二手手机转让的需求时,用数据建模的方法,也能为你提供一个定价的标准。
更多阅读推荐