> We tried upgrading a few times. 1.8, 1.9, and 1.10. None of it helped. We made this change in May 2019. Just getting around to the blog post now since we've been busy.
> Another Discord engineer chiming in here. I worked on trying to fix these spikes on the Go service for a couple weeks. We did indeed try moving up the latest Go at the time (1.10) but this had no effect.
> For a more detailed explanation, it helps to understand what is going on here. It is not the increased CPU utilization that causes the latency. Rather, it's because Go is pausing the entire world for the length of the latency spike. During this time, Go has completely suspended all goroutines which prevents them from doing any work, which appears as latency in requests.
> The specific cause of this seems to be because we used a large free-list like structure, a very long linked list. The head of the list is maintained as a variable, which means that Go's mark phase must start scanning from the head and then pointer chase its way through the list. For whatever reason, Go does (did?) this section in a single-threaded manner with a global lock held. As a result, everything must wait until this extremely long pointer chase occurs.
> It's possible that 1.12 does fix this, but we had tried upgrading a few times already on releases that promised GC fixes and never saw a fix to this issue. I feel the team made a pragmatic choice to divest from Go after giving the language a good attempt at salvaging the project.
The two main cases when linked lists are better are (a) when you want to guarantee a low cost per insert, since vector insertion is only O(1) amortized, and (b) when you want to insert in the middle, but somehow found that middle without scanning the list.
Anyway, in this case, I guess they're using a free list because (1) it's simpler since you don't need an external collection keeping a list of unused stuff, and (2) reason (a) above.
Even if you need middle inserts but not a B-tree (weird), it’s still better to use a vector in most cases. Time to find the insertion point will dominate.
> ... but that statement there doesn’t say anything about the heap size, including the size and count of live objects (i.e., not garbage).
Not sure why you got downvoted, you're actually right, I'm wrong: I misread that and/or assumed one meant the other.
That said, this is a case that should be ideal for generational GC, which Go specifically eschewed at one point. I'm not sure this is still the case, however--I have yet to wade through this[1] to update my knowledge here.
This post needed a lot more depth to really understand what was going on. Statements like
> During garbage collection, Go has to do a lot of work to determine what memory is free, which can slow the program down.
read like blogospam to me (which it is).
For comparison sake - similar post from Twitch has a lot more technical detail and generally makes me view their team in a lot better light than Dicord’s after reading both.
Really? My take was that it was. They mention a bunch of cached data, that rarely got ejected (so not generating a lot of garbage), but that took a long time to traverse (so when GC DID occur, it took a long time), which implies it being large.
> It's surprising they didn't test upgrading to 1.13.
It isn't surprising to me. It's stated elsewhere they tried 4 difference version of Go, up through 1.10 apparently, and had performance problems with all of them. At some point you can't suffer garbage collector nonsense anymore and since they'd already employed Rust on other services they tried it here.
It worked on the first try.
That's not surprising either.
What would be surprising is if any of these "but version such and such is Waaay better and they should just use that" actually panned out. The best case would be that the issue just manifests as some other garbage collector related performance problem. That's the deal you sign up for when you saddle yourself with a garbage collector.
It's still a huge whoosh. You're starting at 1.9 and you're testing 4 micro versions to 1.10.. what is the point of that? None of those non-major versions are going to significantly change how the GC works.
They could have tried 1 other version (not 4) and picked either the latest (1.13) or the version that contains the GC improvements (1.12) to test. Usually when you are looking to upgrade something you skim the release notes so testing 1.12 or 1.13 is obvious especially when 1.12 seems to specifically address their performance concern.
If upgrading something avoids a service re-write that is usually the way to go unless you were looking for an excuse to re-write the service in the first place which may have been the case.
edit:
It turns out they did exactly what my comment stated: they tested the latest version (1.10). It's just that this article was published recently but the events happened quite a while back.
Except that you literally said it wasn't surprising b/c gc sucks. You were "not surprised" in response to the assumption that they DIDN'T test the latest version. However this was just a mis-understanding and you're re-casting your comment to make it seem like you were right all along. If you knew they tested the latest version all along then you couldn't have been surprised or not surprised to something that didn't happen.
It seems like my comment was just an entry point for you to shit on gc which, ironically, I mostly agree with in this context.
According to another comment they did this back in May 2019 when 1.10 was the latest. They are only blogging about it now which I guess is slightly unfortunate but never the less.
Also note this was with Go1.9. I know GC work was ongoing during that time, I wonder if this time of situation would still happen?