上海地铁刷卡数据的清洗、处理与可视化

小猿猴GISer 2021-09-19

Editor's Note

竞赛数据的正确使用方法不是存到云盘里占空间的。

The following article is from Yuan的数据分析 Author YuanLiang

距离上次更新已经过去了一个半月之久，通过与各位读者朋友交流，发现有不少地理和gis的朋友关注我的公众号，可能是之前写的文章多与gis有关这次回归本行，写一篇关于交通的文章，欢迎大家后台私信我与我讨论，尤其是针对技术及idea的讨论，十分欢迎！同时也希望大家在直接开口要数据前有一些自己的思考，毕竟与最终的数据相比，分析的过程与思路才是最重要的。下面开始正文。

这个数据是2015SODA大赛公开的上海公交公司的一卡通数据集，具体的介绍和获取方法网上应该有很多（因此原始数据我不提供，源代码都在文章里，复制粘贴即可），简单的看一下，包括卡的id，线路站点，费用，优惠，刷卡时间几个字段（hour是我后面自己加的）。根据常识，我们进出地铁站要打两次卡，进站不要钱（cost==0），出站时收费，因此我们可以根据这个规则把一个人的进出站的刷卡记录对应起来，找到出行的od站点。用下面这样一行代码，我们对用户和时间进行排序，看看基本情况：

df['timestamp'] = pd.to_datetime(df['timestamp'])df.sort_values(['id','timestamp'])

可以看到，id为1的用户，一天出行了一次（从大连路到书院，花了9块），id为4200000172的用户，这天出行了两次（张江高科-人民广场、人民广场-张江高科）。理想的情况应该是，一个人的打卡记录是偶数次，并且一次cost为0（进站），一次cost不为0（出站）。然而，理想很丰满显示很骨感，通过下面这行代码：

df['id'].value_counts()[df['id'].value_counts()%2 ==1]

我们发现有很多人的打卡次数是奇数次，这可能包括了在前一天开始在今天结束的行程、在今天开始明天结束的行程、和一些可能的没有进站或出站的记录，比如：

df[df['id']==2102265408]

这个老哥4.2号第一次打卡就是出站（第一列），以及在下午4点多来了一次霸王单（非优惠并且cost为0），对于红线这种数据，都是我们需要清洗的（为了方便清洗规则，这里把霸王单也清洗了）。

还比如这个：

我估计是地铁的员工进出站点，都不要钱。

所以我们要的就是那种上车刷卡cost==0，下车刷卡cost！=0的、并且同一个id，且上下车的刷卡时间挨着的数据，比如这种：

然后把上车和下车合并成一行，就是一个人一次地铁出行的信息。

具体怎么操作的话，最开始我写了一个傻瓜版循环：

#def get_trip(df):# df['index'] = list(df.index)# df['trip'] = -1# cardholder = df['id'].unique()# print('共有{}名用户'.format(len(cardholder)))# trip = 0# for x,i in enumerate(cardholder):# if (x%10000) == 0:# print('正在处理第{}个用户的数据'.format(x))# # df_sub = df[df['id']==i]# df_sub = df_sub.sort_values('timestamp')# df_sub.reset_index(inplace=True,drop=True)# for i in range(len(df_sub)-1):# if (df_sub.loc[i,'cost'] == 0) & (df_sub.loc[i+1,'cost'] > 0):# df.loc[df_sub.loc[i,'index'],'trip'] = trip# df.loc[df_sub.loc[i+1,'index'],'trip'] = trip# trip = trip +1# else:# continue# df.drop(columns='index',inplace=True)# return df

总的来说，就是循环提取每个人这一天的出行信息，然后进行筛选，为正常数据赋予trip编号（每次出行上下车的trip编号相同），并把脏数据trip的字段为-1。然而，由于用了双循环（python里for循环的速度你懂的），程序跑起来十分地慢，900万条数据跑完大概需要5个多小时。。。。。。这样肯定是不行的，于是改写了一下代码，增加了几个列用来做关键的判断（前一行后一行的id和cost），利用pandas的apply函数，具体如下：

### 增加用来判断的列def get_shift(df): df['timestamp'] = pd.to_datetime(df['timestamp']) df = df.sort_values(['id','timestamp']) df['id_shift_after'] = df['id'].shift(-1) df['cost_shift_before'] = df['cost'].shift(1) df['cost_shift_after'] = df['cost'].shift(-1) df['id_shift_before'] = df['id'].shift(1) return df

### 用来判断数据是否是脏数据def get_trip_apply(df): if (df['cost'] == 0) & (df['cost_shift_after'] > 0) & (df['id'] == df['id_shift_after']): trip = 1

    elif (df['cost'] > 0) & (df['cost_shift_before'] == 0) & (df['id'] == df['id_shift_after']) & (df['id'] == df['id_shift_before']):

trip = 1

    elif (df['cost'] > 0) &  (df['cost_shift_before'] == 0) & (df['id'] != df['id_shift_after']) & (df['id'] == df['id_shift_before']):

trip = 1 else: trip = -1 return trip

### 主函数，完成数据的清洗与整理，并计算行程时间def get_trip(df): df = get_shift(df) df['trip'] = df.apply(get_trip_apply,axis=1) df = df[df['trip']==1] df = df.drop(columns=['id_shift_after','id_shift_before','cost_shift_after','cost_shift_before']) df['trip'] = np.arange(len(df)) df['trip'] = df['trip']//2 df = df.set_index(['trip',df.groupby('trip').cumcount()+1]).unstack().sort_index(level=1,axis=1)

    df.columns = ['ori_cost','ori_discount','ori_hour','ori_id','ori_route','ori_timestamp','des_cost','des_discount','des_hour','des_id','des_route','des_timestamp']

df['ori_station'] = df.apply(lambda x: x['ori_route'].split('线')[1],axis=1) df['ori_route'] = df.apply(lambda x: x['ori_route'].split('线')[0]+'线',axis=1) df['des_station'] = df.apply(lambda x: x['des_route'].split('线')[1],axis=1) df['des_route'] = df.apply(lambda x: x['des_route'].split('线')[0]+'线',axis=1) df['travel_time(minute)'] = df.apply(lambda x: round((x['des_timestamp']-x['ori_timestamp']).seconds/60,3),axis=1)

    order = ['ori_cost','ori_discount','ori_hour','ori_id','ori_route','ori_station','ori_timestamp','des_cost','des_discount','des_hour','des_id','des_route','des_station','des_timestamp','travel_time(minute)']

df = df[order] return df

%%timedf_clean = get_trip(df)

测试一下，可以得到清洗后的数据(440多万条出行记录，包括od的线路与站点、进出站时间以及费用，还有从进站到出站花费的时间），并且相比双循环速度快了很多。（大家如果有更优的方案可以提出来后台私信我，我这个其实也挺慢的）。

有了这个数据，其实可以分析的东西就很多了。这里推荐大家一篇赵鹏军老师用这个数据写的文章：

故事引人入胜，读完绝对回味无穷。

https://doi.org/10.1016/j.tranpol.2020.03.006

安利完毕，回到主题，今天的主题是可视化客流的特征（主要是od之间的客流特征），这几年有种图特别火，就是一个圈流来流去（学名叫和弦图(chord diagram)），类似这种：

这个是在R语言里画的，python的话也凑合能画（没有R画的好看），实现的具体方法是用holoviews这个库。

先导入holoviews

import holoviews as hvfrom holoviews import opts, dimimport matplotlib.pyplot as pltplt.rcParams['font.sans-serif']=['Arial Unicode MS']plt.rcParams['axes.unicode_minus']=Falsehv.extension('bokeh')

然后画一个各个线路之间客流的和弦图

station = df_clean.iloc[:,4]station = station.drop_duplicates()station = station.reset_index(drop=True).reset_index()station.columns = ['index','route']od = df_clean.groupby(['ori_route','des_route'])['ori_cost'].count().to_frame().reset_index()od['ori_route'] = od['ori_route'].map(station.set_index('route').to_dict()['index'])od['des_route'] = od['des_route'].map(station.set_index('route').to_dict()['index'])nodes = hv.Dataset(station, 'index', 'route')chord = hv.Chord((od, nodes), ['ori_route', 'des_route'], ['ori_cost'])chord.opts( opts.Chord(cmap='glasbey', edge_color=dim('ori_route').str(),

              labels='route',node_color=dim('index').str(),width=1000,height=1000,node_size=8,edge_alpha=0.4,label_text_font_size='12pt'))

还行吧，可以看到，1号线和2号线还是大哥，无论是进站客流还是出站客流都非常的大，除此之外我们还可以进行站点等级的客流od分析，这里选取进站客流最大的前30个站点之间的流量进行可视化：

station = df_clean.iloc[:,5]station = station.drop_duplicates()station = station.reset_index(drop=True).reset_index()station.columns = ['index','station']od = df_clean.groupby(['ori_station','des_station'])['ori_cost'].count().to_frame().reset_index()od['ori_station'] = od['ori_station'].map(station.set_index('station').to_dict()['index'])od['des_station'] = od['des_station'].map(station.set_index('station').to_dict()['index'])nodes = hv.Dataset(station, 'index', 'station')chord = hv.Chord((od, nodes), ['ori_station', 'des_station'], ['ori_cost'])top30 = list(od.groupby('ori_station')['ori_cost'].sum().to_frame().sort_values('ori_cost').iloc[-30:].index.values)top30station = chord.select(ori_station=top30, selection_mode='nodes')top30station.opts( opts.Chord(cmap='glasbey_light', edge_color=dim('ori_station').str(),

              labels='station',node_color=dim('index').str(),width=1000,height=1000,node_size=8,edge_alpha=0.4,label_text_font_size='8pt'))

封面图就做出来了，看着挺酷，但是乱乱的不好解释。

本篇文章到这里就结束啦，如果帮到了你还请点个右下角的“在看”，如果有老板愿意打赏就更好啦，谢谢大家，下期见！

往期文章

利用Python对成都市POI数据进行探索性数据分析

利用python实现地理加权回归(GWR)与网约车订单数据挖掘

利用python分析多种交通方式可达圈

利用Python获取全国人口迁徙OD数据及可视化

利用Python构建空间滞后模型分析网约车出行量影响因素

: ， . Video Mini Program Like ，轻点两下取消赞 Wow ，轻点两下取消在看

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

毕业论文辅导| 你又不是伞，硕士论文|毕业论文|博士论文|课题论文不行就被硬撑了,靠谱的辅导机构才是晴雨伞！

缺人：60r/次，立结~

“四川大学姜涛与爱人程月玲”，你们现在还好吗？

上海地铁刷卡数据的清洗、处理与可视化

您可能也对以下帖子感兴趣

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

毕业论文辅导| 你又不是伞，硕士论文|毕业论文|博士论文|课题论文不行就被硬撑了,靠谱的辅导机构才是晴雨伞！

缺人：60r/次，立结~

“四川大学姜涛与爱人程月玲”，你们现在还好吗？

生成图片，分享到微信朋友圈

上海地铁刷卡数据的清洗、处理与可视化

您可能也对以下帖子感兴趣