数据实战 | 预测二手车每年平均价值损失 | 自由微信

查看原文

其他

数据实战 | 预测二手车每年平均价值损失

凹凸数据 2021-08-09

The following article is from Python数据分析实战与AI干货 Author 木木

↑ 关注 + 星标 ~ 有趣的不像个技术号每晚九点，我们准时相约

大家好，我是朱小五

大家比较喜欢实战，那就分享一篇吧！

数据下载地址：

https://www.kaggle.com/orgesleka/used-cars-database

开工：

01、准备数据

数据集：

Ebay-Kleinanzeigen二手车数据集

[有超过370000辆二手车的相关数据]

数据字段说明：

dateCrawled :当这个广告第一次被抓取日期
name :车的名字
seller : 私人或经销商
offerType
price : 价格
abtest：测试
vehicleType：车辆类型
yearOfRegistration :车辆首次注册年份
gearbox：变速箱
powerPS : 汽车在PS中的功率
model：型号
kilometer : 已经行驶的里程数
monthOfRegistration : 车辆首次注册的月份
fuelType：燃料类型
brand：品牌
notRepairedDamage :车辆有损坏还没修复
dateCreated :在ebay首次创建广告的时间
nrOfPictures :广告中的图片数量
postalCode：邮政编码
lastSeenOnline :当爬虫最后在网上看到这个广告的时候

代码：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, preprocessing, svm
from sklearn.preprocessing import StandardScaler, Normalizer
import math
import matplotlib
import seaborn as sns
df = pd.read_csv('../autos.csv')

02、清洗数据

代码：

#让我们看看数字字段中的一些信息
df.describe()

#丢弃一些无用的列
df.drop(['seller', 'offerType', 'abtest', 'dateCrawled', 'nrOfPictures', 'lastSeen', 'postalCode', 'dateCreated'], axis='columns', inplace=True)

#从重复的、NAN中清除数据并为列选择合理的范围
print("Too new: %d" % df.loc[df.yearOfRegistration >= 2017].count()['name'])
print("Too old: %d" % df.loc[df.yearOfRegistration < 1950].count()['name'])
print("Too cheap: %d" % df.loc[df.price < 100].count()['name'])
print("Too expensive: " , df.loc[df.price > 150000].count()['name'])
print("Too few km: " , df.loc[df.kilometer < 5000].count()['name'])
print("Too many km: " , df.loc[df.kilometer > 200000].count()['name'])
print("Too few PS: " , df.loc[df.powerPS < 10].count()['name'])
print("Too many PS: " , df.loc[df.powerPS > 500].count()['name'])
print("Fuel types: " , df['fuelType'].unique())
#print("Offer types: " , df['offerType'].unique())
#print("Sellers: " , df['seller'].unique())
print("Damages: " , df['notRepairedDamage'].unique())
#print("Pics: " , df['nrOfPictures'].unique()) # nrOfPictures : number of pictures in the ad (unfortunately this field contains everywhere a 0 and is thus useless (bug in crawler!) )
#print("Postale codes: " , df['postalCode'].unique())
print("Vehicle types: " , df['vehicleType'].unique())
print("Brands: " , df['brand'].unique())

# Cleaning data
#valid_models = df.dropna()

#### Removing the duplicates
dedups = df.drop_duplicates(['name','price','vehicleType','yearOfRegistration'
                         ,'gearbox','powerPS','model','kilometer','monthOfRegistration','fuelType'
                         ,'notRepairedDamage'])

#### Removing the outliers
dedups = dedups[
        (dedups.yearOfRegistration <= 2016)
      & (dedups.yearOfRegistration >= 1950)
      & (dedups.price >= 100)
      & (dedups.price <= 150000)
      & (dedups.powerPS >= 10)
      & (dedups.powerPS <= 500)]

print("-----------------\nData kept for analisys: %d percent of the entire set\n-----------------" % (100 * dedups['name'].count() / df['name'].count()))

输出：

处理空值：

dedups.isnull().sum()
dedups['notRepairedDamage'].fillna(value='not-declared', inplace=True)
dedups['fuelType'].fillna(value='not-declared', inplace=True)
dedups['gearbox'].fillna(value='not-declared', inplace=True)
dedups['vehicleType'].fillna(value='not-declared', inplace=True)
dedups['model'].fillna(value='not-declared', inplace=True)

03、可视化

让我们看看一些图表，以了解数据是如何跨类别分布的。

代码：

categories = ['gearbox', 'model', 'brand', 'vehicleType', 'fuelType', 'notRepairedDamage']

for i, c in enumerate(categories):
    v = dedups[c].unique()

    g = dedups.groupby(by=c)[c].count().sort_values(ascending=False)
    r = range(min(len(v), 5))

    print( g.head())
    plt.figure(figsize=(5,3))
    plt.bar(r, g.head())
    #plt.xticks(r, v)
    plt.xticks(r, g.index)
    plt.show()

输出（拿其中一个输出为例）：

04、特征工程

添加名称长度以查看长描述对价格的影响有多大

代码：

dedups['namelen'] = [min(70, len(n)) for n in dedups['name']]

ax = sns.jointplot(x='namelen',
                   y='price',
                   data=dedups[['namelen','price']],
#                   data=dedups[['namelen','price']][dedups['model']=='golf'],
                    alpha=0.1,
                    size=8)

输出：

似乎在15到30个字符之间的名字长度是更好的销售价格。

一个解释可能是一个较长的名称包括更多的选择和配件，因此价格显然更高。很短的名字和很长的名字不能很好的工作。

代码：

labels = ['name', 'gearbox', 'notRepairedDamage', 'model', 'brand', 'fuelType', 'vehicleType']
les = {}

for l in labels:
    les[l] = preprocessing.LabelEncoder()
    les[l].fit(dedups[l])
    tr = les[l].transform(dedups[l])
    dedups.loc[:, l + '_feat'] = pd.Series(tr, index=dedups.index)

labeled = dedups[ ['price'
                        ,'yearOfRegistration'
                        ,'powerPS'
                        ,'kilometer'
                        ,'monthOfRegistration'
                        , 'namelen']
                    + [x+"_feat" for x in labels]]
len(labeled['name_feat'].unique()) / len(labeled['name_feat'])
#输出：0.6224184813880769
#name列的标签占总数的62%。我觉得太多了，所以我删除了这个特征。
labeled.drop(['name_feat'], axis='columns', inplace=True)

05、关联性分析

让我们看看功能如何相互关联，更重要的是，与价格。

代码：

#所有属性间的关联
plot_correlation_map(labeled)
labeled.corr()
labeled.corr().loc[:,'price'].abs().sort_values(ascending=False)[1:]

输出：

相关性是指两个变量的观测值之间的关联。

变量可能有正相关，即当一个变量的值增加时，另一个变量的值也会增加。也可能有负相关，意味着随着一个变量的值增加，其他变量的值减小。变量也可能是中立的，也就是说变量不相关。相关性的量化通常为值-1到1之间的度量，即完全负相关和完全正相关。

计算出的相关结果被称为“ 相关系数”。然后可以解释该相关系数以描述度量。

06、准备模型

代码：

Y = labeled['price']
X = labeled.drop(['price'], axis='columns', inplace=False)
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"1. Before":Y, "2. After":np.log1p(Y)})
prices.hist()
Y = np.log1p(Y)

输出：

代码：

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, Lasso, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score, train_test_split

def cv_rmse(model, x, y):
    r = np.sqrt(-cross_val_score(model, x, y, scoring="neg_mean_squared_error", cv = 5))
    return r

# Percent of the X array to use as training set. This implies that the rest will be test set
test_size = .33

#Split into train and validation
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=test_size, random_state = 3)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

r = range(2003, 2017)
km_year = 10000

07、随机森林

我使用GridSearch为回归器设置最优参数，然后训练最终的模型。我已经删除了其他参数，以便在脱机处理许多参数时快速地将这一点传递到网上。

代码：

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

rf = RandomForestRegressor()

param_grid = { "criterion" : ["mse"]
              , "min_samples_leaf" : [3]
              , "min_samples_split" : [3]
              , "max_depth": [10]
              , "n_estimators": [500]}

gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=2, n_jobs=-1, verbose=1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
bp = gs.best_params_
forest = RandomForestRegressor(criterion=bp['criterion'],
                              min_samples_leaf=bp['min_samples_leaf'],
                              min_samples_split=bp['min_samples_split'],
                              max_depth=bp['max_depth'],
                              n_estimators=bp['n_estimators'])
forest.fit(X_train, y_train)

运行最后得分为0.83！试试看？

近期文章，点击图片即可查看

后台回复关键词「进群」，即刻加入读者交流群~

五

欢迎推荐文章，更欢迎投稿~

朱小五

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

登热榜！某牙电母被S，榜一求爱遭拒！柚柚阿哲合体年度走红毯！

小敏感喊话阿哲，出镜抖音！欠钱不还，小白龙再被扒借贷官司！