I've just spent the morning implementing various "faster" median algorithms, and it seems that constant factors, function call overhead (for recursive quickselects), etc make all alternative approaches quite slow. The sort-then-select approach is pretty damn fast, especially since I also think / am pretty sure sorted() is implemented in C...
So far I've tried median-of-medians, recursive quickselect, iterative quickselect tracking indices and partitioning in place, and heapq.nlargest. The implementation in Python 3.4.0 is both cleaner/easier to read and faster than anything I can make by an order of magnitude for 10,000 element lists. I'm sure someone else here can do better than me, but (s)he'd have a hard time beating CPython, imo.
The reason it's for non-primitives is because it's optimized for sorting lists where comparison is relatively expensive , such as the dereferencing Python does on every item and Java does for non-primitive items.
Thanks for putting the effort in and testing it! Very interesting results.
That's very interesting. I used a C implementation of quickselect in that past that was way faster than sorting. That said, my data was also larger than 10,000 elements.
FWIW, here is some old timing data on a 2008 era Mac:
* approx timing on my Mac
* n, time in ms
* 10, <1
* 100, <1
* 1000, <1
* 10000, <1
* 100000, 30
* 1000000, 230
* 10000000, 2530
* 100000000, 33000
You are probably right that pure python will have trouble competing with a C sort, except for quite large N.
So far I've tried median-of-medians, recursive quickselect, iterative quickselect tracking indices and partitioning in place, and heapq.nlargest. The implementation in Python 3.4.0 is both cleaner/easier to read and faster than anything I can make by an order of magnitude for 10,000 element lists. I'm sure someone else here can do better than me, but (s)he'd have a hard time beating CPython, imo.