英文 | 性能大比拼:list.sort()与sorted()
花下猫语:list.sort() 与 sorted(list) 是常用的列表排序方法,但是,你是否考虑过在占用内存与排序速度上,两者有啥优劣么?今天分享的文章对此做了详尽的考察。
一句话总结:list.sort() 速度更快,更省空间,但它是原地修改,因此会破坏原始数据;list.sort() 只能用于列表排序,而 sorted() 能给任意可迭代对象排序。
那么,list.sort() 比 sorted() 快多少,又省多少空间呢?为什么它会更快更省呢?想知道答案的话,请阅读下文。
作者:Florian Dahlitz
原文:medium 网站[1]
Introduction
Recently, I came across the question, which method to sort a list is more efficient: Using Python’s built-in sorted
function or relying on the list.sort
method. To answer this question I started a little investigation described in this article. You can find the repository I’m referring to on GitHub[2].
The starting point is a Python list containing 1.000.000 random numbers (integers) built using the random
module:
import random
arr = [random.randint(0, 50) for r in range(1_000_000)]
The generated numbers are in the range from 0 (inclusive) to 50 (inclusive).
Memory Consumption
Let’s have a look at the memory consumption of both functions. Therefore, we are using the builtin resource
module to track the maximum memory usage. As the resource module enables us to track the memory usage of a single thread, we are running the sorting of our list in a separate thread. You can use the FunctionSniffingClass
included in the repository[3] to do so.
Let’s have a closer look at our Python script:
import random
import resource
import sys
import time
from sniffing import FunctionSniffingClass
def list_sort(arr):
return arr.sort()
def sorted_builtin(arr):
return sorted(arr)
if __name__ == "__main__":
if len(sys.argv) != 2:
sys.exit("Please run: python (sort|sorted)")
elif sys.argv[1] == "sorted":
func = sorted_builtin
elif sys.argv[1] == "sort":
func = list_sort
else:
sys.exit("Please run: python (sort|sorted)")
# Lib Testing Code
arr = [random.randint(0, 50) for r in range(1_000_000)]
mythread = FunctionSniffingClass(func, arr)
mythread.start()
used_mem = 0
max_memory = 0
memory_usage_refresh = .005 # Seconds
while(1):
time.sleep(memory_usage_refresh)
used_mem = (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
if used_mem > max_memory:
max_memory = used_mem
# Check to see if the function call is complete
if mythread.isShutdown():
# Uncomment if yu want to see the results
# print(mythread.results)
break;
print("\nMAX Memory Usage:", round(max_memory / (2 ** 20), 3), "MB")
We create two wrapper functions for the built-in ones to be able to pass them as arguments to the FunctionSniffingClass
. We could pass the built-in sorted
function directly to the FunctionSniffingClass
, but we want the same chances for both built-ins. Furthermore, some simple command-line argument parsing is implemented to be able to use it as simple as possible from the command-line.
Curious how both built-ins perform? Let’s see!
$ python memory_measurement/main.py sort
Calling the Target Function...
Function Call Complete
MAX Memory Usage: 23.371 MB
$ python memory_measurement/main.py sorted
Calling the Target Function...
Function Call Complete
MAX Memory Usage: 30.879 MB
As you can see, the sorted
function consumed around 32% more memory as the list.sort
method. This was predictable as the latter on modifies the list in-place, whereas the first ones is always creating a separate list.
Speed
To be able to time the execution time of both wrapper functions, we make use of the third-party boxx[4] module. The following gist shows you how we can make use of its timeit
function to time the execution time of both functions.
import random
from boxx import timeit
def list_sort(arr):
return arr.sort()
def sorted_builtin(arr):
return sorted(arr)
def main():
arr = [random.randint(0, 50) for r in range(1_000_000)]
with timeit(name="sorted(list)"):
sorted_builtin(arr)
with timeit(name="list.sort()"):
list_sort(arr)
if __name__ == "__main__":
main()
Note: Be sure to run the sorted_builtin
function first as the list.sort
method sorts the list just in-place, so the sorted
function wouldn’t have to sort anything!
Running the above snippet prints the following output:
$ python main.py
"sorted(list)" spend time: 0.1104379
"list.sort()" spend time: 0.0956471
As you can see, the list.sort
method is slightly faster than the sorted
function. Why is this the case? Let’s disassemble both functions and see, whether we can conclude the answer based on the bytecode:
Both functions bytecode is pretty much the same. The only difference is, that the list_sort
function first loads the list, loads the method (sort) followed by calling the method on the list without any arguments, whereas the the sorted_builtin
function first loads the built-in sorted
function, followed by loading the list and calling the loaded function with the list as argument.
Additionally, both use the same sorting algorithm: Timsort[5]. So if both are using the same sorting algorithm and the bytecode of both is pretty much the same, why are the timing results different?
My guess is, that as list.sort
can work with a known size, and swap elements within that size, whereas sorted
has to work with an unknown size. Therefore, the new list created by sorted
needs to be resized if not enough memory is left when appending a new element. And this takes time!
Having a look at the CPython source code, we find the following comment about resizing list objects:
The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, …
- CPython: Objects/listobject.c
If we bring back to mind, that we are dealing with a list of size 1.000.000, we can see: that’s a lot of resizing! Unfortunately, this is the best answer we get, when asking why list.sort
is 13% faster than sorted
.
Unfortunately my guess is wrong. As Nick Coghlan,[6] one of the CPython core developer, stated on Twitter, the size of the resulting list is known. Basically, the following is happening:
new_array = arr.copy()
arr.sort()
This theory is not correct - sorted knows how big the input is in this case, so it can preallocate the output. What it can't avoid is the extra data copying required to make a whole new list - if you measure "arr2 = arr.copy(); arr2.sort()" it should be comparable to sorted(arr).
----@Nick Coghlan
However, he also states, that it’s not really obvious if you don’t know that it’s there and look explicitly for in the implementation[7].
The resizing idea was a decent guess though - even when you go read the source code, the preallocation trick is buried way down in the list.extend implementation, and hence is easy to miss if you don't already know it is there :)
----@Nick Coghlan
This implementation results in the execution time difference as creating a copy of the list takes some time.
Additional Remarks
Before wrapping up this article, let’s have a look at what the official Python documentation says about this topic.
You can also use the
list.sort()
method. It modifies the list in-place (and returnsNone
to avoid confusion). Usually it’s less convenient thansorted()
- but if you don’t need the original list, it’s slightly more efficient. — Sorting HOW TO[8]
As you can see, the official documentation states, what we have already proven: list.sort
is slightly more efficient. Furthermore, it tells us, that sorted
is usually more convenient.
Another question that my arise is, whether both sorting techniques are stable. Fortunately, the docs have an answer to that:
Sorts are guaranteed to be stable[9]. That means that when multiple records have the same key, their original order is preserved. — Sorting HOW TO[10]
This is also true, when using the reverse parameter or applying the reversed
function twice.
Conclusion
The previous investigations showed us, that list.sort
is slightly faster than sorted
and consumes around 24% less memory. However, keep in mind that list.sort
is only implemented for lists, whereas sorted
accepts any iterable. Furthermore, if you use list.sort
, you will lose your original list.
I hope this article revealed you more insights into the Python programming language. Stay curious and keep coding!
References
[1]
: https://medium.com/@DahlitzF/list-sort-vs-sorted-list-aab92c00e17[2]
GitHub: https://github.com/DahlitzFlorian/list-sort-vs-sorted-list[3]
repository: https://github.com/DahlitzFlorian/list-sort-vs-sorted-list[4]
boxx: https://github.com/DIYer22/boxx[5]
Timsort: https://en.wikipedia.org/wiki/Timsort[6]
Nick Coghlan,: https://twitter.com/ncoghlan_dev[7]
implementation: https://github.com/python/cpython/blob/2fb2bc81c3f40d73945c6102569495140e1182c7/Python/bltinmodule.c#L2238[8]
Sorting HOW TO: https://docs.python.org/3/howto/sorting.html#sorting-basics[9]
stable: https://en.wikipedia.org/wiki/Sorting_algorithm#Stability[10]
Sorting HOW TO: https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts
图片来源:pexels
友情提示: 本号内有限时抽奖活动,送出 Python 书籍 15 本,书目有《Python数据分析与挖掘实战》、《Python语言程序设计》《自学Python编程基础、科学计算及数据分析》《实用机器学习》、《Python程序员面试算法宝典》,活动结束时间为本月 20 日 18 点 ,赶快来抽奖啦!详情戳--> 抽奖送书
随机推荐,偶遇精彩
1
2
3
4
一只伪喵星来客
一个有趣又有用的学习分享平台
专注Python技术、数据科学和深度学习
兼具极客思维与人文情怀
欢迎你关注
微信号:python_cat