其他
Pandas 2.0 主要优势以及代码实现!
原文:Pandas 2.0: A Game-Changer for Data Scientists?
作者:Miriam Santos
翻译:陈超
截图来自作者 2.0发行版看起来在数据科学社区造成了相当大的影响,很多用户都称赞新版本里的改进。
有趣的事实:你意识到这个发行版用了惊人的3年时间制作的吗?这就是我所说的“对社区的承诺”!
所以pandas 2.0带来了什么?让我们立刻深入看一下!
%timeit df = pd.read_csv("data/hn.csv")
# 12 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow', dtype_backend='pyarrow')
# 329 ms ± 65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
如您所见,使用新的后端使读取数据的速度提高了近 35 倍。其他值得指出的方面:
如果没有 pyarrow 后端,每个列/特征都存储为自己的唯一数据类型:数字特征存储为 int64 或 float64,而字符串值存储为对象;
使用 pyarrow,所有功能都使用 Arrow dtypes:请注意 [pyarrow] 注释和不同类型的数据:int64、float64、字符串、时间戳和双精度:
df = pd.read_csv("data/hn.csv")
df.info()
#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# # Column Dtype
# --- ------ -----
# 0 Object ID int64
# 1 Title object
# 2 Post Type object
# 3 Author object
# 4 Created At object
# 5 URL object
# 6 Points int64
# 7 Number of Comments float64
# dtypes: float64(1), int64(2), object(5)
# memory usage: 237.2+ MB
df_arrow = pd.read_csv("data/hn.csv", dtype_backend='pyarrow', engine='pyarrow')
df_arrow.info()
#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# # Column Dtype
# --- ------ -----
# 0 Object ID int64[pyarrow]
# 1 Title string[pyarrow]
# 2 Post Type string[pyarrow]
# 3 Author string[pyarrow]
# 4 Created At timestamp[s][pyarrow]
# 5 URL string[pyarrow]
# 6 Points int64[pyarrow]
# 7 Number of Comments double[pyarrow]
# dtypes: double[pyarrow](1), int64[pyarrow](2), string[pyarrow](4), timestamp[s][pyarrow](1)
# memory usage: 660.2 MB
%timeit df["Author"].str.startswith('phy')
# 851 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow["Author"].str.startswith('phy')
# 27.9 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
pd.Index([1, 2, 3])
# Index([1, 2, 3], dtype='int64')
pd.Index([1, 2, 3], dtype=np.int32)
# Index([1, 2, 3], dtype='int32')
df = pd.read_csv("data/hn.csv")
points = df["Points"]
points.isna().sum()
# 0
points[0:5]
# 0 61
# 1 16
# 2 7
# 3 5
# 4 7
# Name: Points, dtype: int64
# Setting first position to None
points.iloc[0] = None
points[0:5]
# 0 NaN
# 1 16.0
# 2 7.0
# 3 5.0
# 4 7.0
# Name: Points, dtype: float64
df_null = pd.read_csv("data/hn.csv", dtype_backend='numpy_nullable')
points_null = df_null["Points"]
points_null.isna().sum()
# 0
points_null[0:5]
# 0 61
# 1 16
# 2 7
# 3 5
# 4 7
# Name: Points, dtype: Int64
points_null.iloc[0] = None
points_null[0:5]
# 0
# 1 16
# 2 7
# 3 5
# 4 7
# Name: Points, dtype: Int64
pd.options.mode.copy_on_write = False # disable copy-on-write (default in pandas 2.0)
df = pd.read_csv("data/hn.csv")
df.head()
# Throws a 'SettingWithCopy' warning
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
df["Points"][0] = 2000
df.head() # <---- df changes
pd.options.mode.copy_on_write = True
df = pd.read_csv("data/hn.csv")
df.head()
# Throws a ChainedAssignmentError
df["Points"][0] = 2000
# ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame
# or Series through chained assignment. When using the Copy-on-Write mode,
# such chained assignment never works to update the original DataFrame
# or Series, because the intermediate object on which we are setting
# values always behaves as a copy.
# Try using '.loc[row_indexer, col_indexer] = value' instead,
# to perform the assignment in a single step.
df.head() # <---- df does not change
pip install "pandas[postgresql, aws, spss]>=2.0.0"
import pandas as pd
from ydata_profiling import ProfileReport
# Using pandas 1.5.3 and ydata-profiling 4.2.0
%timeit df = pd.read_csv("data/hn.csv")
# 10.1 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit profile = ProfileReport(df, title="Pandas Profiling Report")
# 4.85 ms ± 77.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit profile.to_file("report.html")
# 18.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Using pandas 2.0.2 and ydata-profiling 4.3.1
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow')
# 3.27 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit profile_arrow = ProfileReport(df_arrow, title="Pandas Profiling Report")
# 5.24 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit profile_arrow.to_file("report.html")
# 19 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
性能优化:随着 Apache Arrow 后端的引入、更多的 numpy dtype 索引和写入时复制模式; 增加灵活性和自定义性:允许用户控制可选的依赖项并利用 Apache Arrow 数据类型(包括从一开始的可空性!); 互操作性:也许是新版本的一个不太“广受赞誉”的优势,但影响巨大。由于 Arrow 是独立于语言的,因此内存中的数据不仅可以在基于 Python 构建的程序之间传输,还可以在 R、Spark 和其他使用 Apache Arrow 后端的程序之间传输!