I'm glad they found a good solution (rust) to solve their problem! Also note thi...

faitswulff · on Feb 4, 2020

From /u/DiscordJesse on reddit:

> We tried upgrading a few times. 1.8, 1.9, and 1.10. None of it helped. We made this change in May 2019. Just getting around to the blog post now since we've been busy.

https://www.reddit.com/r/programming/comments/eyuebc/why_dis...

cesarb · on Feb 5, 2020

Another interesting comment in the same reddit thread, from /u/brian-discord (https://old.reddit.com/r/programming/comments/eyuebc/why_dis...):

> Another Discord engineer chiming in here. I worked on trying to fix these spikes on the Go service for a couple weeks. We did indeed try moving up the latest Go at the time (1.10) but this had no effect.

> For a more detailed explanation, it helps to understand what is going on here. It is not the increased CPU utilization that causes the latency. Rather, it's because Go is pausing the entire world for the length of the latency spike. During this time, Go has completely suspended all goroutines which prevents them from doing any work, which appears as latency in requests.

> The specific cause of this seems to be because we used a large free-list like structure, a very long linked list. The head of the list is maintained as a variable, which means that Go's mark phase must start scanning from the head and then pointer chase its way through the list. For whatever reason, Go does (did?) this section in a single-threaded manner with a global lock held. As a result, everything must wait until this extremely long pointer chase occurs.

> It's possible that 1.12 does fix this, but we had tried upgrading a few times already on releases that promised GC fixes and never saw a fix to this issue. I feel the team made a pragmatic choice to divest from Go after giving the language a good attempt at salvaging the project.

earthboundkid · on Feb 5, 2020

Ugh, linked lists fuck everything up. It’s never the right data structure. Use a vector!

giornogiovanna · on Feb 5, 2020

The two main cases when linked lists are better are (a) when you want to guarantee a low cost per insert, since vector insertion is only O(1) amortized, and (b) when you want to insert in the middle, but somehow found that middle without scanning the list.

Anyway, in this case, I guess they're using a free list because (1) it's simpler since you don't need an external collection keeping a list of unused stuff, and (2) reason (a) above.

anarazel · on Feb 5, 2020

That's only true if you actually need to access more than a few elements at once. And if you never need to insert/delete to/from anywhere but the end.

earthboundkid · on Feb 5, 2020

Even if you need middle inserts but not a B-tree (weird), it’s still better to use a vector in most cases. Time to find the insertion point will dominate.

biomcgary · on Feb 4, 2020

I know latency for GC with large heaps improved in Go1.12. See: https://golang.org/doc/go1.12#runtime

kerkeslager · on Feb 4, 2020

The article states specifically that part of the problem was heaps were never large.

EDIT: Actually, no it didn't, I misunderstood it.

typical182 · on Feb 4, 2020

Where does it say that?

It says things like:

“We were not creating a lot of garbage.”

... but that statement there doesn’t say anything about the heap size, including the size and count of live objects (i.e., not garbage).

It also says:

“There are millions of Users in each cache. There are tens of millions of Read States in each cache.”

Large is often in the eye of the beholder, but I missed it if it said anything specifically about not having a large heap size.

kerkeslager · on Feb 4, 2020

> ... but that statement there doesn’t say anything about the heap size, including the size and count of live objects (i.e., not garbage).

Not sure why you got downvoted, you're actually right, I'm wrong: I misread that and/or assumed one meant the other.

That said, this is a case that should be ideal for generational GC, which Go specifically eschewed at one point. I'm not sure this is still the case, however--I have yet to wade through this[1] to update my knowledge here.

[1] https://blog.golang.org/ismmkeynote

dilyevsky · on Feb 4, 2020

LOL that should’ve been at the top. The improvement in gc between 1.9 and 1.12 is absolutely massive. They could’ve just upgraded go toolchain.

kerkeslager · on Feb 4, 2020

The GC changes in 1.12 supposedly target large heaps, which is not Discord's situation.

dilyevsky · on Feb 4, 2020

This post needed a lot more depth to really understand what was going on. Statements like

> During garbage collection, Go has to do a lot of work to determine what memory is free, which can slow the program down.

read like blogospam to me (which it is).

For comparison sake - similar post from Twitch has a lot more technical detail and generally makes me view their team in a lot better light than Dicord’s after reading both.

lostcolony · on Feb 4, 2020

Really? My take was that it was. They mention a bunch of cached data, that rarely got ejected (so not generating a lot of garbage), but that took a long time to traverse (so when GC DID occur, it took a long time), which implies it being large.

ascv · on Feb 4, 2020

It's surprising they didn't test upgrading to 1.13.

topspin · on Feb 4, 2020

> It's surprising they didn't test upgrading to 1.13.

It isn't surprising to me. It's stated elsewhere they tried 4 difference version of Go, up through 1.10 apparently, and had performance problems with all of them. At some point you can't suffer garbage collector nonsense anymore and since they'd already employed Rust on other services they tried it here.

It worked on the first try.

That's not surprising either.

What would be surprising is if any of these "but version such and such is Waaay better and they should just use that" actually panned out. The best case would be that the issue just manifests as some other garbage collector related performance problem. That's the deal you sign up for when you saddle yourself with a garbage collector.

ascv · on Feb 5, 2020

It's still a huge whoosh. You're starting at 1.9 and you're testing 4 micro versions to 1.10.. what is the point of that? None of those non-major versions are going to significantly change how the GC works.

They could have tried 1 other version (not 4) and picked either the latest (1.13) or the version that contains the GC improvements (1.12) to test. Usually when you are looking to upgrade something you skim the release notes so testing 1.12 or 1.13 is obvious especially when 1.12 seems to specifically address their performance concern.

If upgrading something avoids a service re-write that is usually the way to go unless you were looking for an excuse to re-write the service in the first place which may have been the case.

edit:

It turns out they did exactly what my comment stated: they tested the latest version (1.10). It's just that this article was published recently but the events happened quite a while back.

topspin · on Feb 5, 2020

> 1.12 seems to specifically address their performance concern

Maybe it does, maybe it doesn't. Maybe 1.14 has a performance regression with their specific load and they get screwed.

We'll never know. You know why? They permanently solved their GC problems by eliminating GC.

That's the smart play.

ascv · on Feb 5, 2020

Actually it turns out that they DID test the latest version which at the time was 1.10 (see my edit). I guess you should be surprised now? :D

topspin · on Feb 5, 2020

Since I actually cited 1.10 as one they tested I'm not surprised at all.

ascv · on Feb 14, 2020

Except that you literally said it wasn't surprising b/c gc sucks. You were "not surprised" in response to the assumption that they DIDN'T test the latest version. However this was just a mis-understanding and you're re-casting your comment to make it seem like you were right all along. If you knew they tested the latest version all along then you couldn't have been surprised or not surprised to something that didn't happen.

It seems like my comment was just an entry point for you to shit on gc which, ironically, I mostly agree with in this context.

lathiat · on Feb 5, 2020

According to another comment they did this back in May 2019 when 1.10 was the latest. They are only blogging about it now which I guess is slightly unfortunate but never the less.