使用Pandas更好的做数据科学

Kevin Markham 大邓和他的Python 2019-04-26

作者: Kevin Markham

GitHub: https://github.com/justmarkham
YouTube: https://www.youtube.com/dataschool

不管英语好不好，建议大家去看看这个教程，全部下来大概一个小时时间。我看完之后惊呆了，发现pandas居然还可以这么玩。我就列几个印象深刻的神功能。文章末尾含有教程的数据、代码notebook及视频的获取方式。

import pandas as pd
import matplotlib.pyplot as plt

#需要声明，才能在notebook中画图
%matplotlib inline

#我们从中下载的是罗德岛的警务数据，这里以ri代表罗德岛警务数据
ri = pd.read_csv('police.csv')

#查看前5条数据
ri.head()

object实际上就是字符串类，float64是浮点型数值类，bool是布尔值（True， False）

移除某列

county_name这一列全部为空值，所以我们想移除该列。

#inplace默认为False。当inplace=True是指对原始数据进行操作
#写法等同于ri.drop('county_name', axis=1 , inplace=True)
ri.drop('county_name', axis='columns', inplace=True)

#另外一种对含有空值的列进行移除的方法
print(ri.shape)

运行

(91741, 14)

pands的过滤功能

保留满足布尔值为真的数据，这里我们保留violation列值为Speeding的数据，展示前5行。

#男女驾驶员超速违规信息
ri[ri.violation == 'Speeding'].head()

value_counts方法

#超速违规的驾驶员中男女各多少人
print(ri[ri.violation == 'Speeding'].driver_gender.value_counts())

#超速违规的驾驶员中男女占比
print(ri[ri.violation == 'Speeding'].driver_gender.value_counts(normalize=True))

运行

    M    32979
    F    15482
    Name: driver_gender, dtype: int64
    M    0.680527
    F    0.319473
    Name: driver_gender, dtype: float64

groupby方法

查看不同driver_gender中，violation各种值的分布占比

ri.groupby('driver_gender').violation.value_counts(normalize=True)

运行

    driver_gender  violation          
    F              Speeding               0.658500
                   Moving violation       0.136277
                   Equipment              0.105780
                   Registration/plates    0.043086
                   Other                  0.029348
                   Seat belt              0.027009
    M              Speeding               0.524350
                   Moving violation       0.207012
                   Equipment              0.135671
                   Other                  0.057668
                   Registration/plates    0.038461
                   Seat belt              0.036839
    Name: violation, dtype: float64

mean方法

mean方法居然可以默认计算占比

#True为执行搜查，False为未执行搜查
print(ri.search_conducted.value_counts(normalize=True))

print('\n')

# 在这里mean可以计算出True的占比
print(ri.search_conducted.mean())

运行

    False    0.965163
    True     0.034837
    Name: search_conducted, dtype: float64
    0.03483720473942948

value_counts方法中的dropna参数默认为True

#为什么返回的是空序列？
ri[ri.search_conducted == False].search_type.value_counts()

运行

Series([], Name: search_type, dtype: int64)

dropna默认为True，现在更改为False

#原来是value_counts方法的默认dropna=True。现在将其设置为False

ri[ri.search_conducted == False].search_type.value_counts(dropna=False)

运行

    NaN    88545
    Name: search_type, dtype: int64

小节:

pandas的方法默认会将忽略空值，即dropna=True

str方法

通过str方法我们可以对字符串进行判断是否含有某字符串

# 查看搜查类型
ri.search_type.value_counts(dropna=False)

运行

    NaN                                                         88545
    Incident to Arrest                                           1219
    Probable Cause                                                891
    Inventory                                                     220
    Reasonable Suspicion                                          197
    Protective Frisk                                              161
    Incident to Arrest,Inventory                                  129
    Incident to Arrest,Probable Cause                             106
    Probable Cause,Reasonable Suspicion                            75
    Incident to Arrest,Inventory,Probable Cause                    34
    Probable Cause,Protective Frisk                                33
    Incident to Arrest,Protective Frisk                            33
    Inventory,Probable Cause                                       22
    Incident to Arrest,Reasonable Suspicion                        13
    Inventory,Protective Frisk                                     11
    Incident to Arrest,Inventory,Protective Frisk                  11
    Protective Frisk,Reasonable Suspicion                          11
    Incident to Arrest,Probable Cause,Protective Frisk             10
    Incident to Arrest,Probable Cause,Reasonable Suspicion          6
    Inventory,Reasonable Suspicion                                  4
    Incident to Arrest,Inventory,Reasonable Suspicion               4
    Inventory,Probable Cause,Protective Frisk                       2
    Inventory,Probable Cause,Reasonable Suspicion                   2
    Incident to Arrest,Protective Frisk,Reasonable Suspicion        1
    Probable Cause,Protective Frisk,Reasonable Suspicion            1
    Name: search_type, dtype: int64

dataframe的str方法来了

# 将search_type列str后，检查是否含有Protective Frisk字符
ri['frisk'] = ri.search_type.str.contains('Protective Frisk')
ri.frisk.value_counts(dropna=False)

运行

    NaN      88545
    False     2922
    True       274
    Name: frisk, dtype: int64

小节:

dataframe含有字符串方法，可以查看是否含有某字符串

pd.to_datetime函数

将日期字符串转化为datetime类，可以使用datetime类的方法进行日期计算和操作

#将ri.stop_date转化为datetime的格式的dataframe，存到stop_datetime新列中
ri['stop_datetime'] = pd.to_datetime(ri.stop_date)

#注意这里有dt方法，类似于上面的str方法
#dt后可以使用year、month等方法
ri.stop_datetime.dt.year.value_counts()

运行

    2012    10970
    2006    10639
    2007     9476
    2014     9228
    2008     8752
    2015     8599
    2011     8126
    2013     7924
    2009     7908
    2010     7561
    2005     2558
    Name: stop_datetime, dtype: int64

看看dt后使用month后

#统计月份
ri.stop_datetime.dt.month.value_counts()

运行

    1     8479
    5     7935
    11    7877
    10    7745
    3     7742
    6     7630
    8     7615
    7     7568
    4     7529
    9     7427
    12    7152
    2     7042
    Name: stop_datetime, dtype: int64

小节:

使用datetime类，可以操作时间和日期

对时间序列数据绘图

# 绘制drugs_related_stop时间序列
ri['stop_time_datetime'] = pd.to_datetime(ri.stop_time)
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.sum().plot()

png

map方法

Series的map方法可以接受一个函数或含有映射关系的字典型对象。对某一个列进行批操作，本文中是批量替换

# 将0-15 Min换成8
# 16-30 Min换成23
# 30+ Min 换成45
mapping = {'0-15 Min':8, '16-30 Min':23, '30+ Min':45}

#记得这不是原地操作原始数据，需要新建一列存储map后的结果
ri['stop_minutes'] = ri.stop_duration.map(mapping)

#随机查看5条数据
ri['stop_minutes'].sample(5)

运行

    65959    23.0
    32851     NaN
    2645      8.0
    71446     8.0
    62275    23.0
    Name: stop_minutes, dtype: float64

agg方法

使用某种方法如mean、count对某类数据进行操作。

过去agg只能groupby之后的数据进行操作，现在还可以对dataframe类、series类进行操作。

#计算ri均值和个数
ri.agg(['mean', 'count'])

#对某一列进行均值和计数
ri.stop_minutes.agg(['mean', 'count'])

运行

    mean        11.749288
    count    86406.000000
    Name: stop_minutes, dtype: float64

plot方法

画图

#默认是折线图
ri.groupby('violation_raw').stop_minutes.mean().plot()

png

#换成bar图
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')

png

#看着费劲，转化为水平bar图
ri.groupby('violation_raw').stop_minutes.mean().sort_values().plot(kind='barh')

png

往期文章

100G Python学习资料：从入门到精通! 免费下载

上百G文本数据集等你来认领|免费领取

在校大学生如何用知识月入3000

昨晚一口气读完了吴军的《智能时代》

为什么你要为2019，而不是2018做计划？

2017年度15个最好的数据科学领域Python库

【视频讲解】Scrapy递归抓取简书用户信息

数据及代码获取

公众号粉丝刚刚突破5000，开通了流量主。觉得内容有收获，希望大家可以看末尾广告的方式支持大邓。谢谢大家的支持。

文章已于修改

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…