Keeping GC off for a long running service might become problematic. Also, the steady state might have few allocations, but startup may produce a lot of garbage that you might want to evict. I've never done this, but you can also turn GC off at runtime with SetGCPercent(-1).
I think with that, you could turn off GC after startup, then turn it back on at desired intervals (e.g. once an hour or after X cache misses).
It's definitely risky though. E.g. if there is a hiccup with the database backend, the client library might suddenly produce more garbage than normal, and all instances might OOM near the same time. When they all restart with cold caches, they might hammer the database again and cause the issue to repeat.
CloudFront, for this reason, allocates heterogeneous fleets in its PoPs which have diff RAM sizes and CPUs [0], and even different software versions [1].
> When they all restart with cold caches, they might hammer the database again and cause the issue to repeat.
Reminds me of the DynamoDB outage of 2015 that essentially took out us-east-1 [2]. Also, ELB had a similar outage due to unending backlog of work [3].
Someone must write a book on design patterns for distributed system outages or something?
Google's SRE book covers some of this (if you aren't cheekily referring to that). E.g. chapters 21 and 22 are "Handling Overload" and "Addressing Cascading Failures". The SRE book also covers mitigation by operators (e.g. manually setting traffic to 0 at load balancer and ramping back up, manually increasing capacity), but it also talks about engineering the service in the first place.
This is definitely a familiar problem if you rely on caches for throughput (I think caches are most often introduced for latency, but eventually the service is rescaled to traffic and unintentionally needs the cache for throughput). You can e.g. pre-warm caches before accepting requests or load-shed. Load-shedding is really good and more general than pre-warming, so it's probably a great idea to deploy throughout the service anyway. You can also load-shed on the client, so servers don't even have to accept, shed, then close a bunch of connections.
The more general pattern to load-shedding is to make sure you handle a subset of the requests well instead of degrading all requests equally. E.g. processing incoming requests FIFO means that as queue sizes grow, all requests become slower. Using LIFO will allow some requests to be just as fast and the rest will timeout.
I've read the first SRE book but having worked on large-scale systems it is impossible to relate to the book or internalise the advice/process outlined in it unless you've been burned by scale.
So other comments didn't mention this, per se, but Go gives you tools to see what memory escapes the stack and ends up being heap allocated. If you work to ensure things stay stack allocated, it gets freed when the stack frees, and the GC never touches it.
But, per other comments, there isn't any direct malloc/free behavior. It just provides tools to help you enable the compiler to determine that GC is not needed for some.
More details here: https://golang.org/pkg/runtime/