This does not solve the issue of compute scalability - slow computations, which are fundamentally opaque, applied to large data frames . Given a series of data frames (or one large one that can be chunked) how do I apply a long running function to each chunk. For that you need scalability across cores and machines hence Dask.
Why do you consider computations to be opaque? Do you not have the source code?
There is a ton of low hanging speed in many computations that people treat as black boxes. Often as the result of knowing something extra about the specific input data rather than relying on a generic implementation.
In some cases all you need is to write NumPy code instead of Pandas code for a 2-3x speedup. Then suddenly your small cluster program runs on one machine.
Besides the speedup from using native numpy, theres also the potential for 50-100x speedup if your code isn't vectorized to begin with, and anywhere from 1-1000x if theres a couple of joins in there that you can optimize.
But for the latter, see discussion on shifting the pd compute to a RDBMS elsewhere in these comments.
SK Learn is the most popular ML libs. Well written, source code available, etc. But I am not opening it up to optimize it and neither should anyone unless they are already a SK Learn contributor OR have a ton of time on their hands.