I don't know that the ML community necessarily _needs_ a better language than Py...

cgarciae · on April 9, 2020

I think the ML community really needs a better language than Python but not because of the ML part, that works really good, its because of the Data Engineering part (which is 80-90% of most projects) where python really struggles for being slow and not having true parallelism (multiprocessing is suboptimal).

That said I love Python as a language, but if it doesn't fix its issues, on the (very) long run its inevitable the data science community will move to a better solution. Python 4 should focus 100% of JIT compilation.

kyllo · on April 9, 2020

I've found it generally best to push as much of that data prep work down to the database layer, as you possibly can. For small/medium datasets that usually means doing it in SQL, for larger data it may mean using Hadoop/Spark tools to scale horizontally.

I really try to take advantage of the database to avoid ever having to munge very large CSVs in pandas. So like 80-90% of my work is done in query languages in a database, the remaining 10-20% is in Python (or sometimes R) once my data is cooked down to a small enough size to easily fit in local RAM. If the data is still too big, I will just sample it.

blt · on April 9, 2020

Is this tangential advice, or an argument that the current tools are good enough?

kyllo · on April 9, 2020

It's an argument that Python being slow / single-threaded isn't the biggest problem with Python in data engineering. The biggest problem is the need to process data that doesn't fit in RAM on any single machine. So you need on-disk data structures and algorithms that can process them efficiently. If your strategy for data engineering is to load whole CSV files into RAM, replacing Python with a faster language will raise your vertical scaling limit a bit, but beyond a certain scale it won't help anymore and you'll have to switch to a distributed processing model anyway.

mountainriver · on April 9, 2020

Yep this is the key, its not really the data science end its the engineering piece.

mountainriver · on April 9, 2020

Can you get things done in python/c++ sure, but the two language problem is a well known issue, and python has a number of problems. People certainly want a better option, and google investing as much as they did validates that notion.

kyllo · on April 9, 2020

Yes, so to me, the key question is not whether Swift can replace Python's role, but whether it can replace C++'s role, and thereby also making Python's role unnecessary and solving the two-language problem in the process.

mountainriver · on April 9, 2020

I think we can all agree that C++ is a dragon that needs to be slain here. Swift could potentially get close to that for most of the needs, but I still wouldn't bet data scientists would write swift.

kyllo · on April 10, 2020

As a data scientist, most of my projects have been individual--I'm generally the only person writing and reading my code. No one tells me which language I have to use. Python and R are the most popular, and I use either one depending on which has better packages for the task at hand. I don't use Julia because I don't see enough of a benefit to switching at this point. But I really don't care, they're just tools, and I will use any language, Julia, Swift, whatever, if I see enough of a benefit to learning it. I would just take a day or two and learn enough of it to write my scripts in it.

So I think that's the good news--because of the more independent nature of the work, you generally can win data scientists over to a new language one at a time, you don't necessarily need to win over an entire organization at once.

Getting a company or a large open-source project to switch from C++ to Swift or Rust or whatever, seems much harder.

m_ke · on April 9, 2020

Ideally they'd get behind a strict subset of typed python that could be compiled the same way that cython is. Numba, PyTorch JIT and Jax are already handling a decent chunk of the language.

pjmlp · on April 10, 2020

Just like RPython.

kyllo · on April 10, 2020

RPython is not intended for humans to write programs in, it's for implementing interpreters. If you're after a faster Python, you should use PyPy not RPython.

Numba gives you JIT compilation annotations for parallel vector operations--it's a little bit like OpenMP for Python, in a way.

pjmlp · on April 10, 2020

I just look forward to have a proper JIT as part of regular Python, as PyPy still seems to be an underdog, and JIT research for dynamic languages on GraalVM and OpenJ9 seems more focused on Ruby, hence why I kind of hope that Julia puts some pressure into the eco-system.