I worked in Python for years and while I suppose I'm glad for any improvement, I...

klyrs · on Oct 17, 2021

Losing the GIL strictly makes the language strictly more flexible. Previous GILectomies tanked performance to an unacceptable degree. In single-threaded code, this one is a moderate performance improvement in some benchmarks, and a small detriment in others -- which is about as close to perfect as one could expect from such a change. That's why people are excited about it.

At a higher level, Python is getting serious about performance. But this gives both flexibility and performance.

gshulegaard · on Oct 17, 2021

Call me optimistically skeptical. I share similar reservations about GIL obsession with the original comment author, but if this is true:

> The overall effect of this change, and a number of others with it, actually boosts single-threaded performance slightly—by around 10%

Then it sounds like having the cake and eating it too (optimism). Although my experience keeps nagging at me with, "there is not such thing as a free lunch" (skepticism).

comex · on Oct 18, 2021

Perhaps better coverage on LWN:

https://lwn.net/Articles/872869/

The no-GIL version is actually about 8% slower on single-threaded performance than the GIL version, but the author bundled in some unrelated performance improvements that make the no-GIL version overall 10% faster than today's Python.

Ph0X · on Oct 18, 2021

Right, the 20% boost is unrelated to the Gilectomy.

> though, as Guido van Rossum noted, the Python developers could always just take the performance improvements without the concurrency work and be even faster yet.

Why be 10% faster single threaded when you can be 20% faster single threaded!

qeternity · on Oct 18, 2021

This suggests that the unrelated patches improve perf by 20% (0.92 * 1.20 ~= 1.10)

I would love to be proven wrong but I am skeptical.

selcuka · on Oct 18, 2021

That is already explained by the author [1]:

> The resulting interpreter is about 9% faster than the no-GIL proof-of-concept (or ~19% faster than CPython 3.9.0a3). That 9% difference between the “nogil” interpreter and the stripped-down “nogil” interpreter can be thought of as the “cost” of the major GIL-removal changes.

[1] https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsD...

qeternity · on Oct 18, 2021

Thanks for this.

It’s interesting because I think many people (myself included) would be far more interested in the perf patches than the GILectomy.

guenthert · on Oct 18, 2021

Why? It's not like CPython is a speed daemon. I'd think there are some low hanging fruits, simply because performance is such a low priority for the maintainers. It doesn't even do TCO after all.

vitus · on Oct 18, 2021

Subscriber link from Twitter if you (like me) ran into a paywall:

https://lwn.net/SubscriberLink/872869/0e62bba2db51ec7a/

ignoramous · on Oct 18, 2021

also: https://archive.is/1gLVY

thaumasiotes · on Oct 18, 2021

> Although my experience keeps nagging at me with, "there is not such thing as a free lunch" (skepticism).

Well, yeah, someone had to make the changes. That's the cost that was paid.

You can get a mass-produced machete that is cheaper and higher-quality than a 7th-century sword. It's easy for one thing to be better than another thing across several dimensions simultaneously. That's why certain technologies go out of use -- they have negative value compared to other technologies. But that has nothing to do with the principle that there's no such thing as a free lunch.

gshulegaard · on Oct 18, 2021

I feel like you aren't well informed on why removing the GIL results in a single-threaded performance hit. And while I think it's always nice to keep in mind the developer effort required, it's not the only cost as GIL removal has been done before (several times, even as far back as Python 1.5 [1]).

The crux of the issue (as I understand it) is that the GIL absolves the Python interpreter of downstream memory access control. You can replace the GIL with memory access controls of various strategies, but the overhead of that access control is just that: overhead. In a multi-threaded program the concurrency gains should outweigh that overhead, but in a single-threaded one it's just extra work that wasn't being done before.

Which brings us back to no free lunch. It turns out that the claim "10%" faster without the GIL is actually a result of Gross (GIL removal author) doing a multitude of unrelated performance improvements. These performance improvements increase performance enough that the performance of single-threaded no GIL code (with overhead) is ~10% higher than today. But as Guido pointed out, the core developers could upstream the performance improvements without the GIL removal:

> To be clear, Sam’s basic approach is a bit slower for single-threaded code, and he admits that. But to sweeten the pot he has also applied a bunch of unrelated speedups that make it faster in general, so that overall it’s always a win. But presumably we could upstream the latter easily, separately from the GIL-freeing part. [2]

[1] https://docs.python.org/3/faq/library.html#can-t-we-get-rid-...

[2] https://lwn.net/ml/python-dev/CAP7+vJJ1hzXiyDwVs6-eXed+DtodH...

thaumasiotes · on Oct 18, 2021

> he has also applied a bunch of unrelated speedups that make it faster in general

Tell me how "faster in general" doesn't make you suspicious about a free lunch.

gshulegaard · on Oct 18, 2021

To be explicit, I was skeptical because I believed that GIL removal requires adding overhead for managing memory access. Having dug a bit deeper into it that seems confirmed. The proposed GIL removal strategy _is slower for single-threaded code_ like other solutions before it. It turns out the reported performance increase was the result of orthogonal performance improvements overshadowing the overhead of GIL removal.

Put another way, if the performance improvements were upstreamed without removing the GIL the resulting performance increase would be ~20% instead of just ~10%. Which is what Guido was getting at in the quote I cited. Assuming the benchmarks to be true for the moment, this means that removing the GIL on this PoC branch is a 10% performance hit to single-threaded workloads.

atoav · on Oct 18, 2021

> "there is not such thing as a free lunch" (skepticism).

When you carry a heavy suitcase filled with lead and you drop it, things get lighter for free. You paid for it by carrying the damn thing around with you for the whole time.

xiaodai · on Oct 18, 2021

GIL will improve performance of multi-threaded code but the issue with Python performance is single-threaded code and its rich object system.

Can't see Python getting there unless we go to Python 4 which, given the fiasco that was Python 2->3 is probably never gonna happen.

Might as well wait for Julia to improve its TTFP then to hope for a Python 4.

native_samples · on Oct 18, 2021

Well, CPython probably won't ever get there. But Python as a language maybe could.

The GraalPython implementation of Python 3 is built on the JVM, which is a fully thread safe high performance runtime, and Graal/Truffle provide support for speculation on many things. For pure Python it provides a 5-7x speedup already and the implementation is not really mature. Although at the moment they're working on compatibility, in future it might be possible to speculatively remove GIL locks because you have support for things like forcing JITd code to a safepoint and discarding it, if you want to change the basic semantics of the language.

kaba0 · on Oct 18, 2021

How does it relate to PyPy? I read that the latter uses tracing JIT, while GraalPython builds on truffle’s AST-based one, that basically maps the JVM’s primitive structures to Python’s and thus making use of all the man-hours that went into the JVM’s development.

But last time I checked, pypy had much better performance than Graal, even though TruffleJS (javascript interpreter built on the same model as graalpython) has comparable performance to the v8 engine for long running code. Though the latter is the most actively developed truffle language, let me add that.

salawat · on Oct 18, 2021

This is different from Jython, correct?

native_samples · on Oct 18, 2021

It's sort of taking Jython's implementation approach to a much greater extreme, and bypassing bytecode, so it isn't limited by the Java semantics anymore.

It resolves a few big problems Jython had:

- GraalPython is Python 3, not Python 2

- It can use native extensions that plug into the CPython interpreter like NumPy, SciPy etc. The C code is itself virtualized and compiled by the JVM!

salawat · on Oct 18, 2021

Neat! Now that I've gotta check out!

aeturnum · on Oct 17, 2021

Yah, that's definitely the future I'm hoping for. What I am worried about are the kind of transition issues I mentioned. Python 2 -> 3 strictly made the language more flexible too - but the Python ecosystem is about existing code almost more than the language and I worry that we could find similar problems here. Potential for plenty of growing pains while chasing relatively small gains.

ynik · on Oct 17, 2021

In the company I'm working for, we had to spent more engineer time on GIL workarounds (dealing with the extra complexity caused by multiprocessing, e.g. patching C++ libraries to put all their state into shared memory) than we needed for the Python 2 -> 3 migration. And we've only managed to parallelize less than half of our workload so far.

Even if this will be a major breaking change to Python, it'll be worth it for us.

birdyrooster · on Oct 18, 2021

Python needs to be compiled into machine language to ever have a chance of competing on speed. We can already get around the GIL with multiprocess but Python is still to slow even when not bound by copying memory between processes.

saghm · on Oct 18, 2021

The phrase "competing on speed" begs the question "competing...with what?" If the answer is "machine compiled languages", then yes, it's unlikely Python will ever match their speed without also being compiled to machine code, but there are plenty of other interpreted languages with better performance than Python (even ruling out stuff like Java that technically isn't "compiled into machine language" in the way that phrase usually would mean); lots of work is done on JavaScript interpreters to improve performance, and I don't think that specifically has cost the language much flexibility.

acomjean · on Oct 18, 2021

I use python. I don’t love it but it has a good selection of libraries for what I do. It’s not blazing fast but not terribly slow either.

As for multiprocess, I currently have 150 python process running on the work cluster. Each doing their bit of a large task. The heavy lifting is in a python library but it’s C code. but it’s actually not bad performance wise and frankly wasn’t to bad to code up. I think for my use case threads would make it harder.

Maybe Im liking python more over time..

pjmlp · on Oct 18, 2021

Java is technically compiled into machine language, it is a matter to chose the JDK that offers such options, many people don't, but that is their problem, not lack of options.

JavaScript interpreters that people actually use, have a JIT inbox.

nextaccountic · on Oct 22, 2021

> other interpreted languages

Both Java and Javascript are ultimately compiled into machine code, through JIT. And this matters because Python doesn't have JIT.

Even Ruby, that is historically a way slower language, is gaining JIT nowadays. Python has got no excuses.

Ph0X · on Oct 18, 2021

I don't think the goal is to "compete on speed", but I'm sure people wouldn't complain about their Python scripts running 15x faster on their 16 core CPU.

And it is also about flexibility. What I love about Python is the simplicity, and let's be honest, multiprocess anything but. Especially if you fall into one of the gotchas (unpickable data for example).

birdyrooster · on Oct 18, 2021

Multiprocess is quite easy, have you tried aio_multiprocess?

heavyset_go · on Oct 17, 2021

It's because threads in Python are only really good for parallel I/O, and are ineffective for CPU bound workloads.

This can be a problem for a lot of treading use cases. If I'm working on an ETL app that parses large amounts of data, either the related CPU bound tasks need to either run sequentially, call out to C extensions, or use multiple processes which incurs an overhead.

It's a pain when you know threads would suit your use case well, but the threading implementation in the language you're working in isn't up to the task.

ketozhang · on Oct 18, 2021

It's very interesting you mentioned ETL tasks. In ETL batch jobs, a unit in the batch is defined small enough to rarely be CPU bound; rather, as you mentioned, it is I/O bound. In what situation must you define a unit of work to be so heavily CPU bound? To me that's a smell for too large of a unit.

heavyset_go · on Oct 18, 2021

I'm using ETL as shorthand for a "I wrote a script at home that parses my data and puts it in a database, and threads might save time" situation. I wouldn't reach for a thread pool for anything serious.

Godel_unicode · on Oct 18, 2021

While it would be nice to have the language be able to do it, the fact that you can't today leads to some tidy separation of concerns for parallelization. For instance, many people I've talked to use things like Spark or dask to get high scale on data processing tasks. That means that all of the management of distributed jobs is handled through an easily googleable framework that your ops team can manage, as opposed to needing to build all of that yourself.

I see this as being a nice stopgap solution for those who are too big for single-threaded but not big enough to need Spark.

amelius · on Oct 17, 2021

> Performance doesn't come from any one quality, but from the holistic goals at each level of the language.

It starts to become an issue when you have built a few well-performing subsystems and now want them to run together and interact. With the GIL, your subsystems are suddenly not performing as well anymore. Without the GIL, you can still get good performance (within limits of course).

Performance referring here to throughput and/or latency (responsiveness).

dec0dedab0de · on Oct 17, 2021

I agree, but I don't do anything that can be split up, and would benefit from sharing memory. That is really the only benefit of removing the GIL. Multiprocessing can do true concurrency, and so can Celery, which even allows you to use multiple computers. The only time that is a pain is when you need to share memory, or I guess maybe if you're low on resources and can't spare the overhead from multiple processes.

I think a JIT would be the best possible improvement for CPython as far as speed is concerned. Though I can imagine there are plenty of people doing processor heavy stuff with c extensions that would benefit from sharing memory. So from their perspective removing the GIL would be a better improvement.

So basically a JIT would help every Python program, and removing the GIL would only help a small subset of Python programs. Though I'm just happy I get to make a living using Python.

Edit: This was in the back of my head, but I didn't mention it, and it would be unfair to dismiss. A JIT does slowdown startup time, so for short programs that are finished running quickly it may make things worse. Though I suspect it would be easy enough to have a value to turn off the JIT at the start of the program.

lelandbatey · on Oct 18, 2021

Pythons existing threading support (via the threading module) can already do true concurrency just fine. Concurrency and parallelism are not the same thing. The GIL limits parallelism, making separate OS threads operate concurrently but not in parallel. Removing the GIL will allow threads in python programs to operate concurrently and in parallel.

https://go.dev/blog/waza-talk

dec0dedab0de · on Oct 18, 2021

Thank you for pointing this out, I didn't realize there was a difference until just now. If anyone else is confused this stack overflow helped.

https://stackoverflow.com/questions/1050222/what-is-the-diff...

phkahler · on Oct 17, 2021

>> So basically a JIT would help every Python program, and removing the GIL would only help a small subset of Python programs.

What if the "Global Interpreter Lock" needs to be removed for JIT? I put that in quotes to highlight it because AFAICT no compiled (or JITed) language has such a thing. I think it functions differently than regular stuff like critical sections.

dec0dedab0de · on Oct 17, 2021

What if the "Global Interpreter Lock" needs to be removed for JIT?

It doesn't. PyPy has a JIT and a GIL. The JIT just compiles the byte code to native code before running, along with some tricks that are over my head.

Edit: Here is PyPy's FAQ about it

https://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-gil...

native_samples · on Oct 18, 2021

High performance JIT compiling VMs don't use a GIL, they use a different trick called safe points.

The compiled code polls a global or per-thread variable as it runs (but in a very optimized way). When one thread tries to change something that might break another thread, the other threads are brought to a clean halt at specific points in the program where the full state of the abstract interpreter can be reconstructed from the stack and register state. Then the thread stacks are rewritten to force the thread back into the interpreter and the compiled code is deallocated.

The result is that if you need to change something that is in practice only changed very rarely, instead of constantly locking/unlocking a global lock (very, very slow) you replace it with a polling operation (can be very fast as the CPU will execute it speculatively).

However, this requires a lot of very sophisticated low level virtual machinery. The JVM has it. V8 has it. CLR has a limited form of it. Maybe PyPy does, I'm not sure? Most other runtimes do not. For the Python community, very likely the best way to upgrade performance would be to start treating CPython as stable/legacy, then support and encourage efforts like GraalPython. That way the community can re-use all the effort put into the JVM.

xorcist · on Oct 18, 2021

PyPy can utilitize something called software transactional memory to the same effect.

This gives you a unusually fast Python that is also GIL-less. It doesn't seem to be used much, so there may be some compatibility problems or similar, but for a trivial test it worked just as described many years ago.

It also tells me that the GIL isn't terribly important for most things Python is used for. It certainly isn't for me.

heavyset_go · on Oct 18, 2021

Last time I looked, Ruby had a GIL and JIT.

cma · on Oct 17, 2021

There are 64-core, 128-thread prosumer CPUs now and it is only going to go higher. At some point it just becomes necessary.

aeturnum · on Oct 17, 2021

Yes higher core counts are more and more common, but the language has thirty years of single-threaded path-dependence. Lots of elements of it work the way they do because there was a GIL. I could be wrong, but I am skeptical that Python will ever be the best choice for high performance code. It's always worth improving the speed of code when you can, but more often than not you "get" something for going slower. I hope my worries are wrong and this is actually a free win!

turminal · on Oct 17, 2021

What does a 128 thread python app do better than 128 single threaded ones?

heinrichhartman · on Oct 17, 2021

No shared memory. To communicate between processes you usually use sockets, to communicate between threads you mutate variables. This is a huge performance difference.

fulafel · on Oct 18, 2021

You can have (non transparent, but supporting the buffer interface, enough for eg numpy arrays) shared memory with the current multiprocessing stuff: https://docs.python.org/3/library/multiprocessing.shared_mem...

A tangent but I find it amusing to contrast the perpetual Python GIL debate with all the new computation platforms that claim to be focused on scalability. Those are mostly single threaded or max out at a few virtual CPUs (eg "serverless" platforms) and there people applaud it. There people view the isolation as supporting scalability.

jhoechtl · on Oct 17, 2021

OS overhead of 128 processes is higher than scheduling 128 tasks. Varies from os to os, but it's especially bad on Windows.

turminal · on Oct 17, 2021

Yeah, I know about that argument but it just doesn't make sense to me. Removing the GIL means that 1) you make your language runtime more complex and 2) you make your app more complex.

Is it truly worth it just to avoid some memory overhead? Or is there some other windows specific thing that I'm missing here?

dragonwriter · on Oct 17, 2021

> Yeah, I know about that argument but it just doesn't make sense to me. Removing the GIL means that 1) you make your language runtime more complex and 2) you make your app more complex.

#2 need not be true; e.g., the approach proposed here is transparent to most Python code and even minimized impact on C extensions, still exposing the same GIL hook functions which C code would use in the same circumstances, though it has slightly different effect.

lmm · on Oct 18, 2021

It doesn't have to be. On Linux 2.4 (pre-NPTL) processes and threads were represented exactly the same way.

gypsyharlot · on Oct 17, 2021

Shared L3 cache.

semi-extrinsic · on Oct 17, 2021

Well actually, on the types of CPUs that OP refers to (128 threads i.e. AMD Threadripper), L3 cache is only shared within each pair of CCXs that form a CCD. If you launch a program with 32 threads, they may have 1, 2, 3 or 4 distinct L3 caches to work with.

Moreover, unless thread pinning is enforced, a given thread will bounce around between different cores during execution, so the number of distinct L3 caches in action will not be constant.

Of course you have the same story with memory, accessing another thread's memory is slower if that thread is on another CCD.

TL;DR NUMA makes life hard if you want to get consistent performance from parallelism.

Redoubts · on Oct 17, 2021

marshal data

cma · on Oct 17, 2021

That's slower than just doing it single threaded for many use cases.

stjohnswarts · on Oct 18, 2021

I mean is there anything here preventing one from only writing their code to be single threaded tho? This is an addition to the capability and not a detraction.

cma · on Oct 18, 2021

I think you are replying to the wrong post, I'm not making that argument.

turminal · on Oct 17, 2021

Care to elaborate? What does that change for an average webapp?

yuliyp · on Oct 17, 2021

Say your webapp talks to a database or a cache. It'd be really nice if you could use a single connection to that database instead of 64 connections. Or if you wanted to cache some things on the web server, it would be nice if you could have 1 copy easily accessible vs needing 64 copies and needing to fill those caches 64x as much.

semiquaver · on Oct 17, 2021

Unfortunately using a single db/RPC connection for many active threads is not done in any multithreaded system I’m aware of for good reasons. Sharing this type of resource across threads is not safe without expensive and performance-destroying mutexes. In practice each thread needs exclusive access to its own database connection while it is active. This is normally achieved using connection pooling which can save a few connections when some threads are idle, but 1 connection for 64 active web worker threads is not a recipe for a performant web app. If you can point to a multithreaded web app server that works this way I’d be very interested to hear about it.

The idea of a process-local cache (or other data) shared among all worker threads is a different story. Along with reduced memory consumption, I see this as one of the bigger advantages of threaded app servers. However, preforking multiprocess servers can always use shmget(2) to share memory directly with a bit more work.

paulmd · on Oct 18, 2021

> Unfortunately using a single db/RPC connection for many active threads is not done in any multithreaded system I’m aware of for good reasons. Sharing this type of resource across threads is not safe without expensive and performance-destroying mutexes

lol, you're so deep into python stockholm-syndrome "don't share anything between threads because we don't support that at all even a little bit" that you don't even realize that connection pools exist. Instead of holding a connection open per process, you can have one connection pool with 30 connections that services 200 threads (exact ratio depends on how many are actually using connections, of course). literally everybody "shares a single DB/RPC connection across multiple threads" (or at least shares a number of connections across a number of threads), except python.

and yeah you can turn that into yet another standalone service that you gotta deliver in your docker-compose setup, but everybody else just builds that into the application itself.

yellowapple · on Oct 18, 2021

> that you don't even realize that connection pools exist

The GP mentions connection pooling literally three sentences later.

> literally everybody "shares a single DB/RPC connection across multiple threads" (or at least shares a number of connections across a number of threads), except python.

Right, but multiple ≠ many. You're discussing the former. GP is discussing the latter.

yuliyp · on Oct 18, 2021

Depending on the structure, it can indeed be many. Both in the case of protocols which support multiplexing of requests, and in situations where you have multiple databases (thus a given thread might not need to be talking to a particular database all the time).

cm2187 · on Oct 17, 2021

Popularity has probably as much to do, if not more to do, with ease of access (or lack of alternative) than good design of the language. Php is equally if not more popular than python.

aeturnum · on Oct 17, 2021

I'm not a PHP expert, but I did not know it was also used in data science, game programming, embedded programming and machine learning as Python is. Of course they are both used for web services.

tored · on Oct 17, 2021

Funny thing is that PHP does not have GIL, thus it would perform better on each of those things that you list.

antod · on Oct 18, 2021

I thought the reason PHP doesn't have a GIL, is that out of the box, it doesn't even have threads at all?

Is my PHP knowledge out of date? Is the this the current state the art: ? https://www.php.net/manual/en/parallel.setup.php

tored · on Oct 18, 2021

PHP doesn’t ship with an API for creating threads, but PHP can be executed in threads depending on setup. And it does that without using a GIL, instead it internally uses something called Thread-Safe Resource Manager

https://github.com/php/php-src/tree/master/TSRM

I have written about this recently

https://news.ycombinator.com/item?id=28692014

lioeters · on Oct 18, 2021

I don't know much about it, but I've heard here and there about Swoole, a "PHP extension for Async IO, Coroutines and Fibers".

> Swoole is a complete PHP async solution that has built-in support for async programming via fibers/coroutines, a range of multi-threaded I/O modules (HTTP Server, WebSockets, TaskWorkers, Process Pools) and support for popular PHP clients like PDO for MySQL, Redis and CURL.

https://www.swoole.co.uk/

tester756 · on Oct 17, 2021

>again and again, chooses flexibility over performance. It's a good choice! You can see it in how widely Python is used and the diversity of its applications.

What does it mean? how is python different here than Java/C#?

aeturnum · on Oct 17, 2021

I mean, you can modify Python code at runtime if you like. This has a good overview of all the nonsense happening under the hood: http://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

xmcqdpt2 · on Oct 18, 2021

This is my main issue with python. The whole GIL thing is basically necessary because of Python's heavily dynamic model, which is almost always used improperly.

Mypy and other static analysis tools are becoming more common in part because IMO they basically require you to stop and think about your "Pythonic" dynamic patterns (containers of mixed element types, duck typing of function arguments, mutable OOP etc.), and often realize that they are a bad idea.

So in some way we are hampering multithreading to support programming constructs that are mostly used to make python flavoured spaghetti, especially in the hands of beginners and non-programmers who are encouraged to learn Python...

I'm not sure Python is fixable at this point. Oh well.

karmasimida · on Oct 18, 2021

Removing GIL brings flexibility too.

Assuming now you have a big dictionary, which contains tens of millions of records to be filtered upon, another say TB of data. In current Python land, you either:

1. Using multi-processing, but each process needs to create its own dictionary

2. Or create an external DB, and using the DB's client to retrieve data in certain way.

This pattern has occurred again and again in my use case, and it is always messy to solve in Python. Had Python has true multi-threading, then sharing some big but read-only object among real threads would be a possibility, and believe me, a lot of people would be really happy.

didip · on Oct 17, 2021

This is because Python, by luck, ended up dominating the data science market.

In this market you really want to shuffle tons of data quickly, and that’s usually achieved through parallelism.

Python multiprocess library does a poor job at that.

anthk · on Oct 17, 2021

That's calling C and Fortran in the background, actually.

Mehdi2277 · on Oct 18, 2021

That’s a workaround that is not as ergonomic as writing python more directly. The person working on this Gil project is one of the major maintainers of a key ml library. They have used c/c++ binding approach but want to make life easier and be able to multi thread directly in python.

mechanical_bear · on Oct 18, 2021

By luck? I think that is a bit reductionist.

sillysaurusx · on Oct 18, 2021

It wasn't entirely Python's inborn merits. For example, we don't use the best language for each task today. Instead, we use whatever language our coworkers and other companies use, which we often convince ourselves is the best language for a task.

The timing was critical, and although Python may be the Facebook of languages, one can't discount that it was extraordinarily lucky for Guido to be in the precise time and place to capitalize on good design choices.

senderista · on Oct 18, 2021

Ditto with Ruby and web dev. The language was never designed with that application in mind (and very few people used it for that when I got into Ruby around 2000). Path dependence mostly accounts for the ubiquity of Python in science and Ruby in web dev; it just as easily could have gone the other way around.

zone411 · on Oct 18, 2021

Note that Python's poor single-threaded performance compared to single-threaded performance of other languages makes the ability to multi-thread that much more crucial. You can sometimes get away with 10x slower code but not 100x slower.

I've had to rewrite my Python code in another language in 3 different projects already (multi-processing wasn't an option) and I'm not even a heavy user. Removing GIL would be very welcome.

dragonwriter · on Oct 18, 2021

> You can sometimes get away with 10x slower code but not 100x slower.

That Python and other languages in its speed class are still used for new projects in production demonstrates that you can, often, get away with 100× slower code. But, sure, you can get away with 10× slower more often.

lmilcin · on Oct 18, 2021

The issue is people try to get throughput from Python and GIL stands in the way.

A lot of people learned Python and now want to use it for something more serious even if Python was not designed for it.

I work on high volume backends (like million TPS on a single node) and Python for me is just a toy language.

But I see no reason to purposefully inconvenience these people by keeping GIL if it can be removed.

Fiahil · on Oct 18, 2021

Think about the datascience use case: you need to load data from disk or network as fast as possible and compute a lot of CPU-bound operations right after that.

Threads will allow you to split your I/O in multiple procedures, so you might start computations as soon as possible when the data is ready. They will also allow you to massively speed up aggregates without having to create a new process each time (which doesn't allow you to share memory). Threads are a BIG issue when you don't want to rely on asyncio[1].

Note [1]: Asyncio pools are single-threaded because of the GIL. This is already bad enough in practice, but they also perform very badly in CPU-bound contexts. This make them an absolute no-go when dealing with datascience code: a cpu-bounded core in an i/o-bounded wrapper.

alanfranz · on Oct 18, 2021

You can release the GIL in extensions, and a lot of datascience libraries do that, so not all python computations are single threaded.

Fiahil · on Oct 18, 2021

Support is partial and uneven. Moreover, you don't generally use a single datascience library, it's often a mix of pandas, numpy, scikit-learn, custom algorithms, custom optimization suite, and arrow. Letting individual library release the GIL is nice, but you need deep knowledge of those library to know what computation are thread-able or not.

xmcqdpt2 · on Oct 18, 2021

In practice, for single-machine workloads, it's currently mostly numpy and/or whatever deep learning framework you use that does the number crunching.

This means that provided the code is operating on sufficiently large amounts of data (such that calls to numpy are each of sufficient duration), the multithreading in BLAS / Lapack within numpy usually give you weak scaling wrt to thread count without any tricks.

The issue however is that this require by hand making everything into structs of arrays from arrays of structs, removing as many iterations from python as possible, potentially balancing thread usage within python and within numpy etc. By this point IMO your "python" code looks more like Fortran or SQL with better string IO...

Fiahil · on Oct 19, 2021

The number crunching part is already fast enough, however the aggregations, parsing and filtering that come beforehand is really really slow. This part is often done in pure python because it's often custom code tailored to the data you're manipulating.

We're not interested in scaling up the parts that are already fast, but the rather mundane, uninteresting work that come before.

darthrupert · on Oct 17, 2021

Lack of multithreading can easily be a win for a language. A tiny subset of problems really needs it these days and for everything else it's a potential way to either screw things up or make them way more complicated that needs to be.

cm2187 · on Oct 17, 2021

> A tiny subset of problems

like processing web requests?

darthrupert · on Oct 18, 2021

No, that doesn't require multithreading at all.

dkersten · on Oct 18, 2021

Doesn't require, sure. Nothing "requires" multithreading. It may benefit from it, though, since threads are lower overhead (context switching and memory) than processes. If you have any shared data, then that too may be a benefit (but I guess your point is that most web requests don't share data).

cm2187 · on Oct 18, 2021

Also the smaller the memory footprint, the more you live in the CPU cache.

dkersten · on Oct 19, 2021

True.

A lot of people will argue that its not worth optimising for, that programmer time is more important and expensive. This may be true at the start, and is probably even worthwhile when you need to get a product out fast to test the waters in a startup, but I've worked in multiple companies that have spent significant developer time to try to reduce their cloud infrastructure costs. At this point, having your language and famework be able to make good use of the available hardware really can make a real difference to the overall performance and therefore cost, both hardware cost and engineering time to optimise it later.

The productivity gap between languages that optimise for development and those that try to at least somewhat optimise for runtime really isn't that large anymore nowadays. Even modern Java is quite productive now, compared to 10 or 15 year ago. Outside of a startup trying to find product market fit building their MVP's super fast to see what works, I think its usually worth spending a bit more on up front development time, a one off cost, to reduce the recurring infrastructure cost.

Of course, it takes a lot more than just a language that supports multithreading to do this, but everything the language and the libraries/frameworks you use do to help you is helpful. I'd rather have a tool that won't get in my way later, that gives me lots of room to grow when performance starts to become an issue, than one where I need to invest in significant and painful development time (based on personal experiences at least) later. This is one area where Go seems to shine, perhaps Rust too, although I have not tried any web backend dev in Rust yet so don't know how productive it would be.

emrah · on Oct 17, 2021

So if you don't like it or want it, don't use it then? Why does it have to be missing altogether for you to be happy?

darthrupert · on Oct 18, 2021

If it's done without causing complications then sure. I'm highly skeptical that it can be, as no other language ever has been able to do that.

Rust got it pretty good, but they designed a significant part of their language so that multithreading could be done well. Python did almost the opposite.

heavenlyblue · on Oct 18, 2021

Python uses a lot of memory, multi-threading makes you have only one process sharing all the same memory.

emerged · on Oct 17, 2021

From my perspective as a huge Python fan, efficient multithreading is simply the only major thing missing from the language. I would still use C/C++/assembly for bleeding edge performance needs, but efficient multithreading in Python would have me reaching for alternatives far less often.

Basically I love peanut butter ice cream (Python) I’d just like it even more with sprinkles.

m0zg · on Oct 17, 2021

One does not preclude another: the language can be flexible and offer higher concurrency that it does now. My workstation has 64 hyperthreads. Python can use one at a time. That's messed up since I use it as a general purpose language.