查看原文
其他

数据实战 | 预测二手车每年平均价值损失

凹凸数据 2021-08-09

The following article is from Python数据分析实战与AI干货 Author 木木

↑ 关注 + 星标 ~ 有趣的不像个技术号每晚九点,我们准时相约  


大家好,我是朱小五


大家比较喜欢实战,那就分享一篇吧!


数据下载地址:

https://www.kaggle.com/orgesleka/used-cars-database


开工:

01、准备数据


数据集:

Ebay-Kleinanzeigen二手车数据集

[有超过370000辆二手车的相关数据]


数据字段说明:

  • dateCrawled :当这个广告第一次被抓取日期
    name :车的名字
    seller : 私人或经销商
    offerType
    price : 价格
    abtest:测试
    vehicleType:车辆类型
    yearOfRegistration :车辆首次注册年份
    gearbox:变速箱
    powerPS : 汽车在PS中的功率
    model:型号
    kilometer : 已经行驶的里程数
    monthOfRegistration : 车辆首次注册的月份
    fuelType:燃料类型
    brand:品牌
    notRepairedDamage :车辆有损坏还没修复
    dateCreated :在ebay首次创建广告的时间
    nrOfPictures :广告中的图片数量
    postalCode:邮政编码
    lastSeenOnline :当爬虫最后在网上看到这个广告的时候


代码:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, preprocessing, svm
from sklearn.preprocessing import StandardScaler, Normalizer
import math
import matplotlib
import seaborn as sns
df = pd.read_csv('../autos.csv')


02、清洗数据


代码:

#让我们看看数字字段中的一些信息
df.describe()

#丢弃一些无用的列
df.drop(['seller', 'offerType', 'abtest', 'dateCrawled', 'nrOfPictures', 'lastSeen', 'postalCode', 'dateCreated'], axis='columns', inplace=True)

#从重复的、NAN中清除数据并为列选择合理的范围
print("Too new: %d" % df.loc[df.yearOfRegistration >= 2017].count()['name'])
print("Too old: %d" % df.loc[df.yearOfRegistration < 1950].count()['name'])
print("Too cheap: %d" % df.loc[df.price < 100].count()['name'])
print("Too expensive: " , df.loc[df.price > 150000].count()['name'])
print("Too few km: " , df.loc[df.kilometer < 5000].count()['name'])
print("Too many km: " , df.loc[df.kilometer > 200000].count()['name'])
print("Too few PS: " , df.loc[df.powerPS < 10].count()['name'])
print("Too many PS: " , df.loc[df.powerPS > 500].count()['name'])
print("Fuel types: " , df['fuelType'].unique())
#print("Offer types: " , df['offerType'].unique())
#print("Sellers: " , df['seller'].unique())
print("Damages: " , df['notRepairedDamage'].unique())
#print("Pics: " , df['nrOfPictures'].unique()) # nrOfPictures : number of pictures in the ad (unfortunately this field contains everywhere a 0 and is thus useless (bug in crawler!) )
#print("Postale codes: " , df['postalCode'].unique())
print("Vehicle types: " , df['vehicleType'].unique())
print("Brands: " , df['brand'].unique())

# Cleaning data
#valid_models = df.dropna()

#### Removing the duplicates
dedups = df.drop_duplicates(['name','price','vehicleType','yearOfRegistration'
,'gearbox','powerPS','model','kilometer','monthOfRegistration','fuelType'
,'notRepairedDamage'])

#### Removing the outliers
dedups = dedups[
(dedups.yearOfRegistration <= 2016)
& (dedups.yearOfRegistration >= 1950)
& (dedups.price >= 100)
& (dedups.price <= 150000)
& (dedups.powerPS >= 10)
& (dedups.powerPS <= 500)]

print("-----------------\nData kept for analisys: %d percent of the entire set\n-----------------" % (100 * dedups['name'].count() / df['name'].count()))


输出:


处理空值:

dedups.isnull().sum()
dedups['notRepairedDamage'].fillna(value='not-declared', inplace=True)
dedups['fuelType'].fillna(value='not-declared', inplace=True)
dedups['gearbox'].fillna(value='not-declared', inplace=True)
dedups['vehicleType'].fillna(value='not-declared', inplace=True)
dedups['model'].fillna(value='not-declared', inplace=True)


03、可视化


让我们看看一些图表,以了解数据是如何跨类别分布的。


代码:

categories = ['gearbox', 'model', 'brand', 'vehicleType', 'fuelType', 'notRepairedDamage']

for i, c in enumerate(categories):
v = dedups[c].unique()

g = dedups.groupby(by=c)[c].count().sort_values(ascending=False)
r = range(min(len(v), 5))

print( g.head())
plt.figure(figsize=(5,3))
plt.bar(r, g.head())
#plt.xticks(r, v)
plt.xticks(r, g.index)
plt.show()


输出(拿其中一个输出为例):


04、特征工程


添加名称长度以查看长描述对价格的影响有多大


代码:

dedups['namelen'] = [min(70, len(n)) for n in dedups['name']]

ax = sns.jointplot(x='namelen',
y='price',
data=dedups[['namelen','price']],
# data=dedups[['namelen','price']][dedups['model']=='golf'],
alpha=0.1,
size=8)


输出:


似乎在15到30个字符之间的名字长度是更好的销售价格。


一个解释可能是一个较长的名称包括更多的选择和配件,因此价格显然更高。很短的名字和很长的名字不能很好的工作。


代码:

labels = ['name', 'gearbox', 'notRepairedDamage', 'model', 'brand', 'fuelType', 'vehicleType']
les = {}

for l in labels:
les[l] = preprocessing.LabelEncoder()
les[l].fit(dedups[l])
tr = les[l].transform(dedups[l])
dedups.loc[:, l + '_feat'] = pd.Series(tr, index=dedups.index)

labeled = dedups[ ['price'
,'yearOfRegistration'
,'powerPS'
,'kilometer'
,'monthOfRegistration'
, 'namelen']
+ [x+"_feat" for x in labels]]
len(labeled['name_feat'].unique()) / len(labeled['name_feat'])
#输出:0.6224184813880769
#name列的标签占总数的62%。我觉得太多了,所以我删除了这个特征。
labeled.drop(['name_feat'], axis='columns', inplace=True)


05、关联性分析


让我们看看功能如何相互关联,更重要的是,与价格。


代码:

#所有属性间的关联
plot_correlation_map(labeled)
labeled.corr()
labeled.corr().loc[:,'price'].abs().sort_values(ascending=False)[1:]


输出:


相关性是指两个变量的观测值之间的关联。


变量可能有正相关,即当一个变量的值增加时,另一个变量的值也会增加。也可能有负相关,意味着随着一个变量的值增加,其他变量的值减小。变量也可能是中立的,也就是说变量不相关。相关性的量化通常为值-1到1之间的度量,即完全负相关和完全正相关。


计算出的相关结果被称为“ 相关系数”。然后可以解释该相关系数以描述度量。


06、准备模型


代码:

Y = labeled['price']
X = labeled.drop(['price'], axis='columns', inplace=False)
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"1. Before":Y, "2. After":np.log1p(Y)})
prices.hist()
Y = np.log1p(Y)


输出:


代码:

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, Lasso, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score, train_test_split

def cv_rmse(model, x, y):
r = np.sqrt(-cross_val_score(model, x, y, scoring="neg_mean_squared_error", cv = 5))
return r

# Percent of the X array to use as training set. This implies that the rest will be test set
test_size = .33

#Split into train and validation
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=test_size, random_state = 3)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

r = range(2003, 2017)
km_year = 10000


07、随机森林


我使用GridSearch为回归器设置最优参数,然后训练最终的模型。我已经删除了其他参数,以便在脱机处理许多参数时快速地将这一点传递到网上。


代码:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

rf = RandomForestRegressor()

param_grid = { "criterion" : ["mse"]
, "min_samples_leaf" : [3]
, "min_samples_split" : [3]
, "max_depth": [10]
, "n_estimators": [500]}

gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=2, n_jobs=-1, verbose=1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
bp = gs.best_params_
forest = RandomForestRegressor(criterion=bp['criterion'],
min_samples_leaf=bp['min_samples_leaf'],
min_samples_split=bp['min_samples_split'],
max_depth=bp['max_depth'],
n_estimators=bp['n_estimators'])
forest.fit(X_train, y_train)


运行最后得分为0.83!试试看?








近期文章,点击图片即可查看






后台回复关键词「进群」,即刻加入读者交流群~


欢迎推荐文章,更欢迎投稿~

朱小五

    您可能也对以下帖子感兴趣

    文章有问题?点此查看未经处理的缓存