查看原文
其他

软件应用丨Pasdas玩转数据进阶:(一)

向前走别回头 数据Seminar 2021-06-02

版权声明:本文为CSDN博主「向前走别回头的原创文章合辑,遵循 CC 4.0 BY-SA 版权协议,特此附上原文出处链接及本声明

原文链接:

https://blog.csdn.net/weixin_39778570/article/details/81105809

https://blog.csdn.net/weixin_39778570/article/details/81106289

https://blog.csdn.net/weixin_39778570/article/details/81106642

https://blog.csdn.net/weixin_39778570/article/details/81107033

https://blog.csdn.net/weixin_39778570/article/details/81107451




简单计算


import numpy as npimport pandas as pdfrom pandas import Series, DataFrame
# Series 计算 可以计算加减乘,这里以加法为例s1 = Series([1,2,3], index=['B','C','D'])s2 = Series([4,5,6,7], index=['B','C','D','E'])# 没有的数据为nans1 + s2Out[10]: B 5.0C 7.0D 9.0E NaNdtype: float64
# DataFrame计算,可加减乘df1 = DataFrame(np.arange(4).reshape(2,2), index=['A','B'], columns=['BJ','GZ'])df1Out[13]: BJ GZA 0 1B 2 3
df2 = DataFrame(np.arange(9).reshape(3,3), index=['A','B','C'], columns=['BJ', 'GZ', 'SH'])df2Out[15]: BJ GZ SHA 0 1 2B 3 4 5C 6 7 8
df1+df2Out[16]: BJ GZ SHA 0.0 2.0 NaNB 5.0 7.0 NaNC NaN NaN NaN
# DataFrame相关函数df3 = DataFrame([[1,2,3],[4,5,np.nan],[7,8,9]], index=['A','B','C'], columns=['c1','c2','c3'])
df3Out[19]: c1 c2 c3A 1 2 3.0B 4 5 NaNC 7 8 9.0# 列和df3.sum()Out[20]: c1 12.0c2 15.0c3 12.0dtype: float64# 行和df3.sum(axis=1)Out[21]: A 6.0B 9.0C 24.0dtype: float64# 最大值与最小值df3.max()Out[22]: c1 7.0c2 8.0c3 9.0dtype: float64
df3.max(axis=1)Out[23]: A 3.0B 5.0C 9.0dtype: float64
df3.min()Out[24]: c1 1.0c2 2.0c3 3.0dtype: float64
df3.min(axis=1)Out[25]: A 1.0B 4.0C 7.0dtype: float64
# describe描述df3.describe()Out[26]: c1 c2 c3count 3.0 3.0 2.000000mean 4.0 5.0 6.000000std 3.0 3.0 4.242641min 1.0 2.0 3.00000025% 2.5 3.5 4.50000050% 4.0 5.0 6.00000075% 5.5 6.5 7.500000max 7.0 8.0 9.000000

左右滑动查看更多




Series和DataFrame排序


import numpy as npimport pandas as pdfrom pandas import Series, DataFrame
# Series排序s1 = Series(np.random.randn(10))s1Out[5]: 0 -1.2934721 0.0175882 -0.6547413 0.4957204 -1.6263965 -0.6512386 0.7765357 -0.7467628 -1.3589519 0.247930dtype: float64
# 值排序s2 = s1.sort_values()s2Out[10]: 4 -1.6263968 -1.3589510 -1.2934727 -0.7467622 -0.6547415 -0.6512381 0.0175889 0.2479303 0.4957206 0.776535dtype: float64
# 降序排序 s2 = s1.sort_values(ascending=False)s2Out[13]: 6 0.7765353 0.4957209 0.2479301 0.0175885 -0.6512382 -0.6547417 -0.7467620 -1.2934728 -1.3589514 -1.626396dtype: float64# 对index进行排序,降序同样修改ascending为False就好s2.sort_index()Out[14]: 0 -1.2934721 0.0175882 -0.6547413 0.4957204 -1.6263965 -0.6512386 0.7765357 -0.7467628 -1.3589519 0.247930dtype: float64
# DataFrame排序df1 = DataFrame(np.random.randn(40).reshape(8,5), columns=['A','B','C','D','E'])df1Out[17]: A B C D E0 -0.364749 -2.234539 0.560983 -0.205768 -0.6855111 1.500545 0.669751 -0.810748 -1.499093 -0.3698352 0.894716 -0.282788 0.293292 1.260618 -0.1071383 -0.262395 1.970482 1.268629 -0.626314 -0.7268784 -1.756154 0.471681 -0.204594 -0.978793 -2.0825355 0.476344 0.588654 -0.303897 1.863167 -1.4666236 -1.704993 -0.136662 -0.034966 0.159871 -0.8489237 1.117809 0.548713 -1.713026 1.153380 -1.529988
# 某一列Series排序df1['A'].sort_values()Out[18]: 4 -1.7561546 -1.7049930 -0.3647493 -0.2623955 0.4763442 0.8947167 1.1178091 1.500545Name: A, dtype: float64
# DataFrame对某列进行排序df1.sort_values('A')Out[19]: A B C D E4 -1.756154 0.471681 -0.204594 -0.978793 -2.0825356 -1.704993 -0.136662 -0.034966 0.159871 -0.8489230 -0.364749 -2.234539 0.560983 -0.205768 -0.6855113 -0.262395 1.970482 1.268629 -0.626314 -0.7268785 0.476344 0.588654 -0.303897 1.863167 -1.4666232 0.894716 -0.282788 0.293292 1.260618 -0.1071387 1.117809 0.548713 -1.713026 1.153380 -1.5299881 1.500545 0.669751 -0.810748 -1.499093 -0.369835
# 降序排序df2 = df1.sort_values('A', ascending=False)df2Out[22]: A B C D E1 1.500545 0.669751 -0.810748 -1.499093 -0.3698357 1.117809 0.548713 -1.713026 1.153380 -1.5299882 0.894716 -0.282788 0.293292 1.260618 -0.1071385 0.476344 0.588654 -0.303897 1.863167 -1.4666233 -0.262395 1.970482 1.268629 -0.626314 -0.7268780 -0.364749 -2.234539 0.560983 -0.205768 -0.6855116 -1.704993 -0.136662 -0.034966 0.159871 -0.8489234 -1.756154  0.471681 -0.204594 -0.978793 -2.082535
# 对index进行排序df2.sort_index()Out[23]: A B C D E0 -0.364749 -2.234539 0.560983 -0.205768 -0.6855111 1.500545 0.669751 -0.810748 -1.499093 -0.3698352 0.894716 -0.282788 0.293292 1.260618 -0.1071383 -0.262395 1.970482 1.268629 -0.626314 -0.7268784 -1.756154 0.471681 -0.204594 -0.978793 -2.0825355 0.476344 0.588654 -0.303897 1.863167 -1.4666236 -1.704993 -0.136662 -0.034966 0.159871 -0.8489237 1.117809 0.548713 -1.713026 1.153380 -1.529988
# 一个简单的小例子,对movie_metadata.csv的imdb进行排序f = open('movie_metadata.csv')movie = pd.read_csv(f)imdb = movie[["movie_title", "director_name","imdb_score"]].sort_values("imdb_score", ascending=False)imdb.to_csv('imdb.csv')

左右滑动查看更多




DataFrame重命名


import pandas as pdimport numpy as npfrom pandas import Series, DataFrame
df1 = DataFrame(np.arange(9).reshape(3,3), index=['BJ','SH','GZ'], columns=['A','B','C'])df1Out[5]: A B CBJ 0 1 2SH 3 4 5GZ 6 7 8df1.indexOut[6]: Index(['BJ', 'SH', 'GZ'], dtype='object')
# 方式一,直接用Series修改df1.indexOut[6]: Index(['BJ', 'SH', 'GZ'], dtype='object')
# 方式二,使用map进行修改df1.index.map(str.upper)Out[10]: Index(['BJ', 'SH', 'GZ'], dtype='object')
df1.index = df1.index.map(str.upper)df1Out[12]: A B CBJ 0 1 2SH 3 4 5GZ 6 7 8
# 方式三,使用rename进行修改df1.rename(index=str.lower)Out[13]: A B Cbj 0 1 2sh 3 4 5gz 6 7 8df1Out[14]: A B CBJ 0 1 2SH 3 4 5GZ 6 7 8
df1 = df1.rename(index=str.lower, columns=str.lower)
df1Out[18]: a b cbj 0 1 2sh 3 4 5gz 6 7 8
# 同时修改行和列df1 = df1.rename(index=str.lower, columns=str.lower)df1Out[18]: a b cbj 0 1 2sh 3 4 5gz 6 7 8
df1.rename(index={'bj':'beijing'}, columns={'a':'A'})Out[19]: A b cbeijing 0 1 2sh 3 4 5gz 6 7 8
# 谈一谈map, 从list1到list2的方式list1 = [1,2,3,4]list2 = ['1','2','3','4']# 列表解析[str(x) for x in list1]Out[22]: ['1', '2', '3', '4']
# 传入自定义函数def test_map(x): return x+'_ABC'
df1.index.map(test_map)Out[29]: Index(['bj_ABC', 'sh_ABC', 'gz_ABC'], dtype='object')
df1.rename(index=test_map)Out[30]: a b cbj_ABC 0 1 2sh_ABC 3 4 5gz_ABC 6 7 8

左右滑动查看更多




DataFrame的merge


import pandas as pdimport numpy as npfrom pandas import Series, DataFrame
df1 = DataFrame({'key':['X','Y','Z'], 'data_set_1':[1,2,3]})
df1Out[5]: data_set_1 key0 1 X1 2 Y2 3 Z
df2 = DataFrame({'key':['A','B','C'], 'data_set_2':[4,5,6]})df2Out[7]: data_set_2 key0 4 A1 5 B2           6   C
# 没有相同的列值 pd.merge(df1,df2)Out[8]: Empty DataFrameColumns: [data_set_1, key, data_set_2]Index: []
# 默认合并df2 = DataFrame({'key':['X','B','C'], 'data_set_2':[4,5,6]})
pd.merge(df1,df2)Out[10]: data_set_1 key data_set_20           1   X           4
df1 = DataFrame({'key':['X','Y','Z','X'], 'data_set_1':[1,2,3,4]})pd.merge(df1,df2)Out[12]: data_set_1 key data_set_20 1 X 41           4   X           4
# on为指定列,默认情况下会自动找到相同名列,若指定了不同名列会保错,有两列以上相同的需要指定onpd.merge(df1,df2,on='key')Out[13]: data_set_1 key data_set_20 1 X 41           4   X           4
# 连接的方式,how=inner(默认),left,right,outerpd.merge(df1,df2,on='key',how='inner')Out[15]: data_set_1 key data_set_20 1 X 41 4 X 4
pd.merge(df1,df2,on='key',how='left')Out[16]: data_set_1 key data_set_20 1 X 4.01 2 Y NaN2 3 Z NaN3 4 X 4.0
pd.merge(df1,df2,on='key',how='right')Out[17]: data_set_1 key data_set_20 1.0 X 41 4.0 X 42 NaN B 53 NaN C 6
pd.merge(df1,df2,on='key',how='outer')Out[18]: data_set_1 key data_set_20 1.0 X 4.01 4.0 X 4.02 2.0 Y NaN3 3.0 Z NaN4 NaN B 5.05 NaN C 6.0

左右滑动查看更多




Concatenate和Combine


import pandas as pdimport numpy as npfrom pandas import Series, DataFrame
# arrange上的Concatenatearr1 = np.arange(9).reshape(3,3)arr1Out[6]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])arr2 = np.arange(9).reshape(3,3)arr2Out[9]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])# 进行concatenate,axis参数表示结合方向,默认0是纵向结合np.concatenate([arr1,arr2])Out[10]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 1, 2], [3, 4, 5], [6, 7, 8]])np.concatenate([arr1,arr2], axis=1)Out[11]: array([[0, 1, 2, 0, 1, 2], [3, 4, 5, 3, 4, 5], [6, 7, 8, 6, 7, 8]]) # Series上的concatenates1 = Series([1,2,3], index=['X','Y','Z'])S2 = Series([4,5], index=['A','B'])S2Out[15]: A 4B 5dtype: int64
pd.concat([s1,S2])Out[16]: X 1Y 2Z 3A 4B 5
# 缺失值会补齐为NaNpd.concat([s1,S2], axis=1)Out[17]: 0 1A NaN 4.0B NaN 5.0X 1.0 NaNY 2.0 NaNZ  3.0  NaN
# DataFrame上的comcatenatedf1 = DataFrame(np.random.rand(4,3), columns=['X','Y','Z'])df1Out[20]: X Y Z0 0.093816 0.087879 0.5398441 0.087522 0.012905 0.4465222 0.269924 0.213385 0.9004693  0.004105  0.437186  0.817560
df2 = DataFrame(np.random.rand(3,3), columns=['X','Y','A'])df2Out[22]: X Y A0 0.938714 0.122255 0.1891251 0.592859 0.459991 0.5964782 0.337845 0.977800 0.401993
pd.concat([df1,df2])Out[24]: A X Y Z0 NaN 0.093816 0.087879 0.5398441 NaN 0.087522 0.012905 0.4465222 NaN 0.269924 0.213385 0.9004693 NaN 0.004105 0.437186 0.8175600 0.189125 0.938714 0.122255 NaN1 0.596478 0.592859 0.459991 NaN2 0.401993 0.337845 0.977800 NaN
pd.concat([df1,df2],axis=1)Out[25]: X Y Z X Y A0 0.093816 0.087879 0.539844 0.938714 0.122255 0.1891251 0.087522 0.012905 0.446522 0.592859 0.459991 0.5964782 0.269924 0.213385 0.900469 0.337845 0.977800 0.4019933 0.004105 0.437186 0.817560 NaN NaN NaN
# Combine,后一个对象补齐前一个对象# Seriess1 = Series([2,np.nan,4,np.nan], index=['A','B','C','D'])s1Out[29]: A 2.0B NaNC 4.0D NaNdtype: float64
s2 = Series([1,2,3,4], index=['A','B','C','D'])s2Out[31]: A 1B 2C 3D 4dtype: int64
# s1中没有的值被s2补齐了s1.combine_first(s2)Out[32]: A 2.0B 2.0C 4.0D 4.0dtype: float64
# DataFrame,和Series类似df1 = DataFrame({'X':[1,np.nan,3,np.nan], 'Y':[5,np.nan,7,np.nan], 'Z':[9,np.nan,11,np.nan]})df1Out[36]: X Y Z0 1.0 5.0 9.01 NaN NaN NaN2 3.0 7.0 11.03 NaN NaN NaN
df2 = DataFrame({'Z':[np.nan,10,np.nan,12], 'A':[1,2,3,4]})df2Out[38]: A Z0 1 NaN1 2 10.02 3 NaN3 4 12.0
df1.combine_first(df2)Out[39]: A X Y Z0 1.0 1.0 5.0 9.01 2.0 NaN NaN 10.02 3.0 3.0 7.0 11.03 4.0 NaN NaN 12.0

左右滑动查看更多




·END·



点击搜索你感兴趣的内容吧


软件应用丨Pandas入门系列(三):简单数据处理


软件应用丨Pandas入门系列(二):Pandas io操作


软件应用丨Pandas入门系列(一):深入理解Series和DataFrame






数据Seminar




这里是大数据、分析技术与学术研究的三叉路口





出处:CSDN作者:向前走别回头推荐:青酱排版编辑:青酱 





    欢迎扫描👇二维码添加关注    

    您可能也对以下帖子感兴趣

    文章有问题?点此查看未经处理的缓存