I worked in Python for years and while I suppose I'm glad for any improvement, I have never understood the obsession with true multi-threading. Languages are about trade-offs and Python, again and again, chooses flexibility over performance. It's a good choice! You can see it in how widely Python is used and the diversity of its applications.
Performance doesn't come from any one quality, but from the holistic goals at each level of the language. I think some of the most frustrating aspects from the history of Python have been when the team lost focus on why and how people used the language (i.e. the 2 -> 3 transition, though I have always loved 3). I hope that this is a sensible optimization and not an over-extension.
Losing the GIL strictly makes the language strictly more flexible. Previous GILectomies tanked performance to an unacceptable degree. In single-threaded code, this one is a moderate performance improvement in some benchmarks, and a small detriment in others -- which is about as close to perfect as one could expect from such a change. That's why people are excited about it.
At a higher level, Python is getting serious about performance. But this gives both flexibility and performance.
Call me optimistically skeptical. I share similar reservations about GIL obsession with the original comment author, but if this is true:
> The overall effect of this change, and a number of others with it, actually boosts single-threaded performance slightly—by around 10%
Then it sounds like having the cake and eating it too (optimism). Although my experience keeps nagging at me with, "there is not such thing as a free lunch" (skepticism).
The no-GIL version is actually about 8% slower on single-threaded performance than the GIL version, but the author bundled in some unrelated performance improvements that make the no-GIL version overall 10% faster than today's Python.
Right, the 20% boost is unrelated to the Gilectomy.
> though, as Guido van Rossum noted, the Python developers could always just take the performance improvements without the concurrency work and be even faster yet.
Why be 10% faster single threaded when you can be 20% faster single threaded!
> The resulting interpreter is about 9% faster than the no-GIL proof-of-concept (or ~19% faster than CPython 3.9.0a3). That 9% difference between the “nogil” interpreter and the stripped-down “nogil” interpreter can be thought of as the “cost” of the major GIL-removal changes.
Why? It's not like CPython is a speed daemon. I'd think there are some low hanging fruits, simply because performance is such a low priority for the maintainers. It doesn't even do TCO after all.
> Although my experience keeps nagging at me with, "there is not such thing as a free lunch" (skepticism).
Well, yeah, someone had to make the changes. That's the cost that was paid.
You can get a mass-produced machete that is cheaper and higher-quality than a 7th-century sword. It's easy for one thing to be better than another thing across several dimensions simultaneously. That's why certain technologies go out of use -- they have negative value compared to other technologies. But that has nothing to do with the principle that there's no such thing as a free lunch.
I feel like you aren't well informed on why removing the GIL results in a single-threaded performance hit. And while I think it's always nice to keep in mind the developer effort required, it's not the only cost as GIL removal has been done before (several times, even as far back as Python 1.5 [1]).
The crux of the issue (as I understand it) is that the GIL absolves the Python interpreter of downstream memory access control. You can replace the GIL with memory access controls of various strategies, but the overhead of that access control is just that: overhead. In a multi-threaded program the concurrency gains should outweigh that overhead, but in a single-threaded one it's just extra work that wasn't being done before.
Which brings us back to no free lunch. It turns out that the claim "10%" faster without the GIL is actually a result of Gross (GIL removal author) doing a multitude of unrelated performance improvements. These performance improvements increase performance enough that the performance of single-threaded no GIL code (with overhead) is ~10% higher than today. But as Guido pointed out, the core developers could upstream the performance improvements without the GIL removal:
> To be clear, Sam’s basic approach is a bit slower for single-threaded code, and he admits that. But to sweeten the pot he has also applied a bunch of unrelated speedups that make it faster in general, so that overall it’s always a win. But presumably we could upstream the latter easily, separately from the GIL-freeing part. [2]
To be explicit, I was skeptical because I believed that GIL removal requires adding overhead for managing memory access. Having dug a bit deeper into it that seems confirmed. The proposed GIL removal strategy _is slower for single-threaded code_ like other solutions before it. It turns out the reported performance increase was the result of orthogonal performance improvements overshadowing the overhead of GIL removal.
Put another way, if the performance improvements were upstreamed without removing the GIL the resulting performance increase would be ~20% instead of just ~10%. Which is what Guido was getting at in the quote I cited. Assuming the benchmarks to be true for the moment, this means that removing the GIL on this PoC branch is a 10% performance hit to single-threaded workloads.
> "there is not such thing as a free lunch" (skepticism).
When you carry a heavy suitcase filled with lead and you drop it, things get lighter for free. You paid for it by carrying the damn thing around with you for the whole time.
Well, CPython probably won't ever get there. But Python as a language maybe could.
The GraalPython implementation of Python 3 is built on the JVM, which is a fully thread safe high performance runtime, and Graal/Truffle provide support for speculation on many things. For pure Python it provides a 5-7x speedup already and the implementation is not really mature. Although at the moment they're working on compatibility, in future it might be possible to speculatively remove GIL locks because you have support for things like forcing JITd code to a safepoint and discarding it, if you want to change the basic semantics of the language.
How does it relate to PyPy? I read that the latter uses tracing JIT, while GraalPython builds on truffle’s AST-based one, that basically maps the JVM’s primitive structures to Python’s and thus making use of all the man-hours that went into the JVM’s development.
But last time I checked, pypy had much better performance than Graal, even though TruffleJS (javascript interpreter built on the same model as graalpython) has comparable performance to the v8 engine for long running code. Though the latter is the most actively developed truffle language, let me add that.
It's sort of taking Jython's implementation approach to a much greater extreme, and bypassing bytecode, so it isn't limited by the Java semantics anymore.
It resolves a few big problems Jython had:
- GraalPython is Python 3, not Python 2
- It can use native extensions that plug into the CPython interpreter like NumPy, SciPy etc. The C code is itself virtualized and compiled by the JVM!
Yah, that's definitely the future I'm hoping for. What I am worried about are the kind of transition issues I mentioned. Python 2 -> 3 strictly made the language more flexible too - but the Python ecosystem is about existing code almost more than the language and I worry that we could find similar problems here. Potential for plenty of growing pains while chasing relatively small gains.
In the company I'm working for, we had to spent more engineer time on GIL workarounds (dealing with the extra complexity caused by multiprocessing, e.g. patching C++ libraries to put all their state into shared memory) than we needed for the Python 2 -> 3 migration. And we've only managed to parallelize less than half of our workload so far.
Even if this will be a major breaking change to Python, it'll be worth it for us.
Python needs to be compiled into machine language to ever have a chance of competing on speed. We can already get around the GIL with multiprocess but Python is still to slow even when not bound by copying memory between processes.
The phrase "competing on speed" begs the question "competing...with what?" If the answer is "machine compiled languages", then yes, it's unlikely Python will ever match their speed without also being compiled to machine code, but there are plenty of other interpreted languages with better performance than Python (even ruling out stuff like Java that technically isn't "compiled into machine language" in the way that phrase usually would mean); lots of work is done on JavaScript interpreters to improve performance, and I don't think that specifically has cost the language much flexibility.
I use python. I don’t love it but it has a good selection of libraries for what I do. It’s not blazing fast but not terribly slow either.
As for multiprocess, I currently have 150 python process running on the work cluster. Each doing their bit of a large task. The heavy lifting is in a python library but it’s C code. but it’s actually not bad performance wise and frankly wasn’t to bad to code up. I think for my use case threads would make it harder.
Java is technically compiled into machine language, it is a matter to chose the JDK that offers such options, many people don't, but that is their problem, not lack of options.
JavaScript interpreters that people actually use, have a JIT inbox.
I don't think the goal is to "compete on speed", but I'm sure people wouldn't complain about their Python scripts running 15x faster on their 16 core CPU.
And it is also about flexibility. What I love about Python is the simplicity, and let's be honest, multiprocess anything but. Especially if you fall into one of the gotchas (unpickable data for example).
It's because threads in Python are only really good for parallel I/O, and are ineffective for CPU bound workloads.
This can be a problem for a lot of treading use cases. If I'm working on an ETL app that parses large amounts of data, either the related CPU bound tasks need to either run sequentially, call out to C extensions, or use multiple processes which incurs an overhead.
It's a pain when you know threads would suit your use case well, but the threading implementation in the language you're working in isn't up to the task.
It's very interesting you mentioned ETL tasks. In ETL batch jobs, a unit in the batch is defined small enough to rarely be CPU bound; rather, as you mentioned, it is I/O bound. In what situation must you define a unit of work to be so heavily CPU bound? To me that's a smell for too large of a unit.
I'm using ETL as shorthand for a "I wrote a script at home that parses my data and puts it in a database, and threads might save time" situation. I wouldn't reach for a thread pool for anything serious.
While it would be nice to have the language be able to do it, the fact that you can't today leads to some tidy separation of concerns for parallelization. For instance, many people I've talked to use things like Spark or dask to get high scale on data processing tasks. That means that all of the management of distributed jobs is handled through an easily googleable framework that your ops team can manage, as opposed to needing to build all of that yourself.
I see this as being a nice stopgap solution for those who are too big for single-threaded but not big enough to need Spark.
> Performance doesn't come from any one quality, but from the holistic goals at each level of the language.
It starts to become an issue when you have built a few well-performing subsystems and now want them to run together and interact. With the GIL, your subsystems are suddenly not performing as well anymore. Without the GIL, you can still get good performance (within limits of course).
Performance referring here to throughput and/or latency (responsiveness).
I agree, but I don't do anything that can be split up, and would benefit from sharing memory. That is really the only benefit of removing the GIL. Multiprocessing can do true concurrency, and so can Celery, which even allows you to use multiple computers. The only time that is a pain is when you need to share memory, or I guess maybe if you're low on resources and can't spare the overhead from multiple processes.
I think a JIT would be the best possible improvement for CPython as far as speed is concerned. Though I can imagine there are plenty of people doing processor heavy stuff with c extensions that would benefit from sharing memory. So from their perspective removing the GIL would be a better improvement.
So basically a JIT would help every Python program, and removing the GIL would only help a small subset of Python programs. Though I'm just happy I get to make a living using Python.
Edit: This was in the back of my head, but I didn't mention it, and it would be unfair to dismiss. A JIT does slowdown startup time, so for short programs that are finished running quickly it may make things worse. Though I suspect it would be easy enough to have a value to turn off the JIT at the start of the program.
Pythons existing threading support (via the threading module) can already do true concurrency just fine. Concurrency and parallelism are not the same thing. The GIL limits parallelism, making separate OS threads operate concurrently but not in parallel. Removing the GIL will allow threads in python programs to operate concurrently and in parallel.
>> So basically a JIT would help every Python program, and removing the GIL would only help a small subset of Python programs.
What if the "Global Interpreter Lock" needs to be removed for JIT? I put that in quotes to highlight it because AFAICT no compiled (or JITed) language has such a thing. I think it functions differently than regular stuff like critical sections.
High performance JIT compiling VMs don't use a GIL, they use a different trick called safe points.
The compiled code polls a global or per-thread variable as it runs (but in a very optimized way). When one thread tries to change something that might break another thread, the other threads are brought to a clean halt at specific points in the program where the full state of the abstract interpreter can be reconstructed from the stack and register state. Then the thread stacks are rewritten to force the thread back into the interpreter and the compiled code is deallocated.
The result is that if you need to change something that is in practice only changed very rarely, instead of constantly locking/unlocking a global lock (very, very slow) you replace it with a polling operation (can be very fast as the CPU will execute it speculatively).
However, this requires a lot of very sophisticated low level virtual machinery. The JVM has it. V8 has it. CLR has a limited form of it. Maybe PyPy does, I'm not sure? Most other runtimes do not. For the Python community, very likely the best way to upgrade performance would be to start treating CPython as stable/legacy, then support and encourage efforts like GraalPython. That way the community can re-use all the effort put into the JVM.
PyPy can utilitize something called software transactional memory to the same effect.
This gives you a unusually fast Python that is also GIL-less. It doesn't seem to be used much, so there may be some compatibility problems or similar, but for a trivial test it worked just as described many years ago.
It also tells me that the GIL isn't terribly important for most things Python is used for. It certainly isn't for me.
Yes higher core counts are more and more common, but the language has thirty years of single-threaded path-dependence. Lots of elements of it work the way they do because there was a GIL. I could be wrong, but I am skeptical that Python will ever be the best choice for high performance code. It's always worth improving the speed of code when you can, but more often than not you "get" something for going slower. I hope my worries are wrong and this is actually a free win!
No shared memory. To communicate between processes you usually use sockets, to communicate between threads you mutate variables. This is a huge performance difference.
A tangent but I find it amusing to contrast the perpetual Python GIL debate with all the new computation platforms that claim to be focused on scalability. Those are mostly single threaded or max out at a few virtual CPUs (eg "serverless" platforms) and there people applaud it. There people view the isolation as supporting scalability.
Yeah, I know about that argument but it just doesn't make sense to me. Removing the GIL means that 1) you make your language runtime more complex and 2) you make your app more complex.
Is it truly worth it just to avoid some memory overhead? Or is there some other windows specific thing that I'm missing here?
> Yeah, I know about that argument but it just doesn't make sense to me. Removing the GIL means that 1) you make your language runtime more complex and 2) you make your app more complex.
#2 need not be true; e.g., the approach proposed here is transparent to most Python code and even minimized impact on C extensions, still exposing the same GIL hook functions which C code would use in the same circumstances, though it has slightly different effect.
Well actually, on the types of CPUs that OP refers to (128 threads i.e. AMD Threadripper), L3 cache is only shared within each pair of CCXs that form a CCD. If you launch a program with 32 threads, they may have 1, 2, 3 or 4 distinct L3 caches to work with.
Moreover, unless thread pinning is enforced, a given thread will bounce around between different cores during execution, so the number of distinct L3 caches in action will not be constant.
Of course you have the same story with memory, accessing another thread's memory is slower if that thread is on another CCD.
TL;DR NUMA makes life hard if you want to get consistent performance from parallelism.
I mean is there anything here preventing one from only writing their code to be single threaded tho? This is an addition to the capability and not a detraction.
Say your webapp talks to a database or a cache. It'd be really nice if you could use a single connection to that database instead of 64 connections. Or if you wanted to cache some things on the web server, it would be nice if you could have 1 copy easily accessible vs needing 64 copies and needing to fill those caches 64x as much.
Unfortunately using a single db/RPC connection for many active threads is not done in any multithreaded system I’m aware of for good reasons. Sharing this type of resource across threads is not safe without expensive and performance-destroying mutexes. In practice each thread needs exclusive access to its own database connection while it is active. This is normally achieved using connection pooling which can save a few connections when some threads are idle, but 1 connection for 64 active web worker threads is not a recipe for a performant web app. If you can point to a multithreaded web app server that works this way I’d be very interested to hear about it.
The idea of a process-local cache (or other data) shared among all worker threads is a different story. Along with reduced memory consumption, I see this as one of the bigger advantages of threaded app servers. However, preforking multiprocess servers can always use shmget(2) to share memory directly with a bit more work.
> Unfortunately using a single db/RPC connection for many active threads is not done in any multithreaded system I’m aware of for good reasons. Sharing this type of resource across threads is not safe without expensive and performance-destroying mutexes
lol, you're so deep into python stockholm-syndrome "don't share anything between threads because we don't support that at all even a little bit" that you don't even realize that connection pools exist. Instead of holding a connection open per process, you can have one connection pool with 30 connections that services 200 threads (exact ratio depends on how many are actually using connections, of course). literally everybody "shares a single DB/RPC connection across multiple threads" (or at least shares a number of connections across a number of threads), except python.
and yeah you can turn that into yet another standalone service that you gotta deliver in your docker-compose setup, but everybody else just builds that into the application itself.
> that you don't even realize that connection pools exist
The GP mentions connection pooling literally three sentences later.
> literally everybody "shares a single DB/RPC connection across multiple threads" (or at least shares a number of connections across a number of threads), except python.
Right, but multiple ≠ many. You're discussing the former. GP is discussing the latter.
Depending on the structure, it can indeed be many. Both in the case of protocols which support multiplexing of requests, and in situations where you have multiple databases (thus a given thread might not need to be talking to a particular database all the time).
Popularity has probably as much to do, if not more to do, with ease of access (or lack of alternative) than good design of the language. Php is equally if not more popular than python.
I'm not a PHP expert, but I did not know it was also used in data science, game programming, embedded programming and machine learning as Python is. Of course they are both used for web services.
PHP doesn’t ship with an API for creating threads, but PHP can be executed in threads depending on setup. And it does that without using a GIL, instead it internally uses something called Thread-Safe Resource Manager
I don't know much about it, but I've heard here and there about Swoole, a "PHP extension for Async IO, Coroutines and Fibers".
> Swoole is a complete PHP async solution that has built-in support for async programming via fibers/coroutines, a range of multi-threaded I/O modules (HTTP Server, WebSockets, TaskWorkers, Process Pools) and support for popular PHP clients like PDO for MySQL, Redis and CURL.
>again and again, chooses flexibility over performance. It's a good choice! You can see it in how widely Python is used and the diversity of its applications.
What does it mean? how is python different here than Java/C#?
This is my main issue with python. The whole GIL thing is basically necessary because of Python's heavily dynamic model, which is almost always used improperly.
Mypy and other static analysis tools are becoming more common in part because IMO they basically require you to stop and think about your "Pythonic" dynamic patterns (containers of mixed element types, duck typing of function arguments, mutable OOP etc.), and often realize that they are a bad idea.
So in some way we are hampering multithreading to support programming constructs that are mostly used to make python flavoured spaghetti, especially in the hands of beginners and non-programmers who are encouraged to learn Python...
I'm not sure Python is fixable at this point. Oh well.
Assuming now you have a big dictionary, which contains tens of millions of records to be filtered upon, another say TB of data. In current Python land, you either:
1. Using multi-processing, but each process needs to create its own dictionary
2. Or create an external DB, and using the DB's client to retrieve data in certain way.
This pattern has occurred again and again in my use case, and it is always messy to solve in Python. Had Python has true multi-threading, then sharing some big but read-only object among real threads would be a possibility, and believe me, a lot of people would be really happy.
That’s a workaround that is not as ergonomic as writing python more directly. The person working on this Gil project is one of the major maintainers of a key ml library. They have used c/c++ binding approach but want to make life easier and be able to multi thread directly in python.
It wasn't entirely Python's inborn merits. For example, we don't use the best language for each task today. Instead, we use whatever language our coworkers and other companies use, which we often convince ourselves is the best language for a task.
The timing was critical, and although Python may be the Facebook of languages, one can't discount that it was extraordinarily lucky for Guido to be in the precise time and place to capitalize on good design choices.
Ditto with Ruby and web dev. The language was never designed with that application in mind (and very few people used it for that when I got into Ruby around 2000). Path dependence mostly accounts for the ubiquity of Python in science and Ruby in web dev; it just as easily could have gone the other way around.
Note that Python's poor single-threaded performance compared to single-threaded performance of other languages makes the ability to multi-thread that much more crucial. You can sometimes get away with 10x slower code but not 100x slower.
I've had to rewrite my Python code in another language in 3 different projects already (multi-processing wasn't an option) and I'm not even a heavy user. Removing GIL would be very welcome.
> You can sometimes get away with 10x slower code but not 100x slower.
That Python and other languages in its speed class are still used for new projects in production demonstrates that you can, often, get away with 100× slower code. But, sure, you can get away with 10× slower more often.
Think about the datascience use case: you need to load data from disk or network as fast as possible and compute a lot of CPU-bound operations right after that.
Threads will allow you to split your I/O in multiple procedures, so you might start computations as soon as possible when the data is ready. They will also allow you to massively speed up aggregates without having to create a new process each time (which doesn't allow you to share memory). Threads are a BIG issue when you don't want to rely on asyncio[1].
Note [1]: Asyncio pools are single-threaded because of the GIL. This is already bad enough in practice, but they also perform very badly in CPU-bound contexts. This make them an absolute no-go when dealing with datascience code: a cpu-bounded core in an i/o-bounded wrapper.
Support is partial and uneven. Moreover, you don't generally use a single datascience library, it's often a mix of pandas, numpy, scikit-learn, custom algorithms, custom optimization suite, and arrow. Letting individual library release the GIL is nice, but you need deep knowledge of those library to know what computation are thread-able or not.
In practice, for single-machine workloads, it's currently mostly numpy and/or whatever deep learning framework you use that does the number crunching.
This means that provided the code is operating on sufficiently large amounts of data (such that calls to numpy are each of sufficient duration), the multithreading in BLAS / Lapack within numpy usually give you weak scaling wrt to thread count without any tricks.
The issue however is that this require by hand making everything into structs of arrays from arrays of structs, removing as many iterations from python as possible, potentially balancing thread usage within python and within numpy etc. By this point IMO your "python" code looks more like Fortran or SQL with better string IO...
The number crunching part is already fast enough, however the aggregations, parsing and filtering that come beforehand is really really slow. This part is often done in pure python because it's often custom code tailored to the data you're manipulating.
We're not interested in scaling up the parts that are already fast, but the rather mundane, uninteresting work that come before.
Lack of multithreading can easily be a win for a language. A tiny subset of problems really needs it these days and for everything else it's a potential way to either screw things up or make them way more complicated that needs to be.
Doesn't require, sure. Nothing "requires" multithreading. It may benefit from it, though, since threads are lower overhead (context switching and memory) than processes. If you have any shared data, then that too may be a benefit (but I guess your point is that most web requests don't share data).
A lot of people will argue that its not worth optimising for, that programmer time is more important and expensive. This may be true at the start, and is probably even worthwhile when you need to get a product out fast to test the waters in a startup, but I've worked in multiple companies that have spent significant developer time to try to reduce their cloud infrastructure costs. At this point, having your language and famework be able to make good use of the available hardware really can make a real difference to the overall performance and therefore cost, both hardware cost and engineering time to optimise it later.
The productivity gap between languages that optimise for development and those that try to at least somewhat optimise for runtime really isn't that large anymore nowadays. Even modern Java is quite productive now, compared to 10 or 15 year ago. Outside of a startup trying to find product market fit building their MVP's super fast to see what works, I think its usually worth spending a bit more on up front development time, a one off cost, to reduce the recurring infrastructure cost.
Of course, it takes a lot more than just a language that supports multithreading to do this, but everything the language and the libraries/frameworks you use do to help you is helpful. I'd rather have a tool that won't get in my way later, that gives me lots of room to grow when performance starts to become an issue, than one where I need to invest in significant and painful development time (based on personal experiences at least) later. This is one area where Go seems to shine, perhaps Rust too, although I have not tried any web backend dev in Rust yet so don't know how productive it would be.
If it's done without causing complications then sure. I'm highly skeptical that it can be, as no other language ever has been able to do that.
Rust got it pretty good, but they designed a significant part of their language so that multithreading could be done well. Python did almost the opposite.
From my perspective as a huge Python fan, efficient multithreading is simply the only major thing missing from the language. I would still use C/C++/assembly for bleeding edge performance needs, but efficient multithreading in Python would have me reaching for alternatives far less often.
Basically I love peanut butter ice cream (Python) I’d just like it even more with sprinkles.
One does not preclude another: the language can be flexible and offer higher concurrency that it does now. My workstation has 64 hyperthreads. Python can use one at a time. That's messed up since I use it as a general purpose language.
Performance doesn't come from any one quality, but from the holistic goals at each level of the language. I think some of the most frustrating aspects from the history of Python have been when the team lost focus on why and how people used the language (i.e. the 2 -> 3 transition, though I have always loved 3). I hope that this is a sensible optimization and not an over-extension.