This does not solve the issue of compute scalability - slow computations, which ...

jzwinck · on July 6, 2020

Why do you consider computations to be opaque? Do you not have the source code?

There is a ton of low hanging speed in many computations that people treat as black boxes. Often as the result of knowing something extra about the specific input data rather than relying on a generic implementation.

In some cases all you need is to write NumPy code instead of Pandas code for a 2-3x speedup. Then suddenly your small cluster program runs on one machine.

isoprophlex · on July 6, 2020

Besides the speedup from using native numpy, theres also the potential for 50-100x speedup if your code isn't vectorized to begin with, and anywhere from 1-1000x if theres a couple of joins in there that you can optimize.

But for the latter, see discussion on shifting the pd compute to a RDBMS elsewhere in these comments.

haltingproblem · on July 6, 2020

SK Learn is the most popular ML libs. Well written, source code available, etc. But I am not opening it up to optimize it and neither should anyone unless they are already a SK Learn contributor OR have a ton of time on their hands.