We run migrations across 2,800 microservices

HenryBemis · on Aug 27, 2024

Or as I call it "death by a thousand microcuts".

I always wonder why some (most) banks are proud of being reckless.. oh well, it keeps me well paid.

Also, Monzo decided to remove the "dark mode" option back-in-the-day. When I wrote to them about it "please return it as optional - as it already was" they responded with a polite "nope, suck it up". My next message to them was to close my account. Well.. "nope, suck it up" back right at you.

superzamp · on Aug 27, 2024

If you see micro-services as (a domain boundary + an atomic infrastructural unit), maybe the latter is debatable but the application domain do need 2800 boundaries anyway? Especially for such a large company, especially for a company operating in financial services.

HenryBemis · on Aug 29, 2024

I'm 'old-school' where sometimes (and in some industries where regulations/audits will ride you HARD) it is better to have a 'core banking system' (with all its pros and cons) instead of 2800 microservices. Good luck auditing that! (and since Monzo is a bank they either got 1000 internal auditors doing IT audits, or they are faking everything they do - and I won't get to the discussion of their external auditors).

Anyway, since I don't like them, one may say that I'm negatively biased, but still, how does their Audit Universe look like?

gsck · on Aug 27, 2024

I like Monzo as a bank, I think what they are doing is pretty cool.

But it all stills very amateur-ish, especially for a bank. Something as simple as being able to generate a proof of payment receipt for a bank transfer, why is this not possible? It feels incredibly unprofessional to send a screenshot of a mobile app to a company because your bank doesn't allow you to properly export a PDF for one single transaction.

jjice · on Aug 27, 2024

What constitutes a micro service when you have 2800? Are these individual lambdas for each endpoint and background task or something?

willsewell · on Aug 27, 2024

I think we've gone back on forth on this over the years. The rate of new services is decreasing, so I think we have shifted more from lots of very small services (low thousands lines of code) to bigger ones that are more like an entire product (e.g. 100k LoC+). But I wouldn't be surprised if the pendulum shifts back again in the future - there are downsides with the larger services, like greater contention withe other engineers.

piva00 · on Aug 27, 2024

2800 is not even that many depending on the company size. At my current employer I'd guess we have somewhere between 2-3x that, and those are more complex services than a simple lambda + endpoint in each. Just my user with its groups gives me ownership of some 200+ components.

Those are only backend microservices, not counting data pipelines, and other supporting applications.

What constitutes a microservice is a philosophical question very related to the company's culture, some companies will prefer very small single use-case services, some will develop microservices that support a whole isolated functionality (with business logic) to be re-used across the stack. There's no single definition that can be applicable to such variety of architectures.

ebiester · on Aug 27, 2024

That's what I've seen elsewhere, at least.

willsewell · on Aug 27, 2024

There was previous discussion related to our microservices architecture here: https://news.ycombinator.com/item?id=22725989.

lucianbr · on Aug 27, 2024

> it would require a lot of effort to update all call sites, and in some cases the benefit of the new API was minimal. By wrapping the old library it meant we could choose to keep the interface similar to the old library in these cases, making it easier to update call sites.

Doesn't wrapping the old library require a lot of effort to update all call sites?

If this is supposed to be general advice about libraries... does this mean wrap all libraries? Does not sound like a good idea to me.

willsewell · on Aug 27, 2024

I think this depends on the library. There are some libraries that are just used in a handful of services, and in those cases I don't think that wrapping is worth the overhead.

However for some of the core platform functionality depended on by all services, we've seen a lot of value in wrapping because it:

- We can provide a more opinionated interface (a third party library is typically more flexible because it has vastly more variety in its use cases) - We can hook in our own instrumentation easily (e.g. metrics, logging) or change functionality with config/feature flags - We can transparently change the implementation in the future

0xbadcafebee · on Aug 27, 2024

The whole idea of (reliably) deploying and rolling back without downtime I don't think gets nearly enough meme-worthy attention on HN. It's quite complicated and depends entirely on a number of variables (specifically how you do everything). I wrote an internal paper once which was probably 30 pages just to explain why we couldn't do automatic rollbacks.

The most important parts of such a system (the ones mentioned in this post, anyway) don't get nearly enough attention:

- "centrally driven migrations": In any distributed service architecture, there are always too many interdependent pieces. You can't reliably touch thing A without also touching things B, C, D, etc. If you want any chance of automation or responding to failure without downtime, you must have a system which is aware of the changing state of everything and can change all the parts at a whim.

- "database migrations": This is again very complicated and depends on how your code and database are architected. You literally can't do migrations if your code and schema aren't set up right, and if you don't make the right kind of changes. How do you do this? Time to write a book...

- "wrap the old library": I can't remember what this is called, but it has a name. Anyway, the idea is hiding any change behind what is effectively a feature flag wrapper allows you to deploy the change without it being enabled, use the feature flag to test the change in production (on only one rest, on a percentage of requests, on one whole node/pod, etc), and then delete the old code eventually. This isn't just for features; you can replace entire interfaces, software stacks, whole systems this way, either piecemeal or entirely. Very powerful, but again, requires a specific approach not only in implementation but in use.

- "use automated rollback checks": What kind of checks? Checking what? In what way? At what time/stage? What happens when one fails? Do you do them in series or parallel? Can you do them in series or parallel? etc

- "deploy least critical services first": With enough interdependent services, you're going to hit cases where you have to upgrade parts B and C effectively simultaneously before you can upgrade A, etc. So for "no downtime", it will take a lot of coordination, and very explicit linkage and checking of specific new services, etc. There are ways to do this, but it's specific to your implementation and services, so this is another example of how you have to know exactly what's going on, and then set up the deployment to account for your specific dependency tree and how they react when they're run.

So many people I've run into don't think about any of these things. They literally say things like "automated rollbacks are easy, we did it at XYZ place", as if none of the above matter at all. They literally stick their head in the sand because they want to believe that it should be easy. But any engineer worth their salt will tell you that to do it correctly and reliably is bloody complicated.