实战:使用机器学习算法进行航班票价预测!
The following article is from 机器学习社区 Author 机器猫
大家好,实战是学习机器学习的最好方法。今天在本文中,我们将使用机器学习方法来对航班票价进行预测。为方便大家实操,文末提供完整版代码和数据。
关于数据集
数据来kaggle比赛数据,我首先对数据集的字段进行说明,方便后续分析和理解
Airline:所有类型的航空公司,例如 Indigo、Jet Airways、Air India Date_of_Journey:乘客旅程的开始日期 Source:乘客旅程开始的地点名称 Destination:乘客想要前往的地点的名称 Route:乘客选择从他/她的来源到目的地的路线是什么 Arrival_Time:乘客到达目的地的时间 Duration:持续时间是航班完成从源头到目的地的旅程的整个时间 Total_Stops:整个旅程中将在多少地方停留 Additional_Info:获得有关食物、食物种类和其他便利设施的信息 Price:完整旅程的航班价格,包括登机前的所有费用
引入库
为方便后续航空价格做预测,我们先引入对应库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from math import sqrt
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from prettytable import PrettyTable
读取数据
train_df = pd.read_excel("Data_Train.xlsx")
train_df.head(10)
探索性数据分析(EDA)
我们查看数据集所具有的列类型。
train_df.columns
#Output
Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
'Additional_Info', 'Price'],
dtype='object')
在这里,我们可以获得数据集的更多信息
train_df.info()
train_df.describe()
train_df.isnull().head()
train_df.isnull().sum()
#output
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 1
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 1
Additional_Info 0
Price 0
dtype: int64
删除 NAN 值
train_df.dropna(inplace = True)
重复值
train_df[train_df.duplicated()].head()
train_df.drop_duplicates(keep='first',inplace=True)
train_df.head()
train_df.shape
#output
(10462, 11)
检查 Additional_info 列并计算唯一类型的值。
train_df["Additional_Info"].value_counts()
#output
No info 8182
In-flight meal not included 1926
No check-in baggage included 318
1 Long layover 19
Change airports 7
Business class 4
No Info 3
1 Short layover 1
2 Long layover 1
Red-eye flight 1
Name: Additional_Info, dtype: int64
检查不同的航空公司
train_df["Airline"].unique()
#output
array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
'Vistara Premium economy', 'Jet Airways Business',
'Multiple carriers Premium economy', 'Trujet'], dtype=object)
检查不同的航线
train_df["Route"].unique()
# output
现在让我们看看我们的测试数据集
test_df = pd.read_excel("Test_set.xlsx")
test_df.head(10)
test_df.columns
#output
Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
'Additional_Info'],
dtype='object')
有关数据集的信息
test_df.info()
test_df.describe()
test_df.isnull().sum()
# output
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
dtype: int64
绘制价格(Price)与航空公司(Airline)图
绘制价格与来源的小提琴图
sns.catplot(y = "Price", x = "Source", data = train_df.sort_values("Price", ascending = False), kind="violin", height = 4, aspect = 3)
plt.show()
绘制价格与目的地的箱线图
sns.catplot(y = "Price", x = "Destination", data = train_df.sort_values("Price", ascending = False), kind="box", height = 4, aspect = 3)
plt.show()
特征工程
先看看数据
train_df.head()
train_df['Duration'] = train_df['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
test_df['Duration'] = test_df['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
Date_of_Journey:标准化旅程日期的格式,以便在模型阶段进行更好的预处理。
train_df["Journey_day"] = train_df['Date_of_Journey'].str.split('/').str[0].astype(int)
train_df["Journey_month"] = train_df['Date_of_Journey'].str.split('/').str[1].astype(int)
train_df.drop(["Date_of_Journey"], axis = 1, inplace = True)
Dep_Time:将出发时间转换为小时和分钟
train_df["Dep_hour"] = pd.to_datetime(train_df["Dep_Time"]).dt.hour
train_df["Dep_min"] = pd.to_datetime(train_df["Dep_Time"]).dt.minute
train_df.drop(["Dep_Time"], axis = 1, inplace = True)
Arrival_Time:将到达时间转换为小时和分钟。
train_df["Arrival_hour"] = pd.to_datetime(train_df.Arrival_Time).dt.hour
train_df["Arrival_min"] = pd.to_datetime(train_df.Arrival_Time).dt.minute
train_df.drop(["Arrival_Time"], axis = 1, inplace = True)
在最后的预处理之后,让我们看看数据集
plt.figure(figsize = (10, 5))
plt.title('Count of flights month wise')
ax=sns.countplot(x = 'Journey_month', data = train_df)
plt.xlabel('Month')
plt.ylabel('Count of flights')
for p in ax.patches:
ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va='bottom', color= 'black')
绘制航空公司类型与航班数量的条形图
plt.figure(figsize = (20,5))
plt.title('Count of flights with different Airlines')
ax=sns.countplot(x = 'Airline', data =train_df)
plt.xlabel('Airline')
plt.ylabel('Count of flights')
plt.xticks(rotation = 45)
for p in ax.patches:
ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va='bottom', color= 'black')
绘制机票价格 VS 航空公司
plt.figure(figsize = (15,4))
plt.title('Price VS Airlines')
plt.scatter(train_df['Airline'], train_df['Price'])
plt.xticks
plt.xlabel('Airline')
plt.ylabel('Price of ticket')
plt.xticks(rotation = 90)
所有特征之间的相关性
绘制相关性
plt.figure(figsize = (15,15))
sns.heatmap(train_df.corr(), annot = True, cmap = "RdYlGn")
plt.show()
data = train_df.drop(["Price"], axis=1)
处理分类数据和数值数据
train_categorical_data = data.select_dtypes(exclude=['int64', 'float','int32'])
train_numerical_data = data.select_dtypes(include=['int64', 'float','int32'])
test_categorical_data = test_df.select_dtypes(exclude=['int64', 'float','int32','int32'])
test_numerical_data = test_df.select_dtypes(include=['int64', 'float','int32'])
train_categorical_data.head()
le = LabelEncoder()
train_categorical_data = train_categorical_data.apply(LabelEncoder().fit_transform)
test_categorical_data = test_categorical_data.apply(LabelEncoder().fit_transform)
train_categorical_data.head()
X = pd.concat([train_categorical_data, train_numerical_data], axis=1)
y = train_df['Price']
test_set = pd.concat([test_categorical_data, test_numerical_data], axis=1)
X.head()
构建模型
1、构建评估模型的平均绝对百分比误差
# Calculating Mean Absolute Percentage Error
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
2、切分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
数据统计如下
3、模型训练整个流程
Ridge Regression
比较所有模型
# Comparing all the models
models = pd.DataFrame({
'Model': [ 'Ridge Regression', 'Lasso Regression','Decision Tree Regressor'],
'Score': [ ridge_score, lasso_score, decision_score],
'Test Score': [ ridge_score_test, lasso_score_test, decision_score_test]})
models.sort_values(by='Test Score', ascending=False)
结论
正如我们所看到的,我们已经完成了一个整个模型开发流程,包括数据洞察、特征工程和数据可视化、用机器学习模型制作步骤进行预测等。当然你也可以用更复杂的模型来做,文章涉及的数据,我也会提供给大家。
1. 关注下方公众号,点击右上角;
2. 在下方后台回复关键词「航班」快速下载: