Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That’s ok until you reach the logical IO and compute capacity of off the shelf PC hardware. Also 20 years of insane coupling and complexity makes it pretty difficult to move away from that architecture which is where most companies seem to end up with from experience.

Also it’s a complete fallacy that “if we’re successful or grow large enough, we’ll just rewrite it”.

And then there’s the logical problems of managing complexity, distributing work to development teams and even things as determining change impact that become terrible difficult.

I say we should design with distribution in mind but deploy without it. Assuming you can scale up forever is an expensive and stupid mistake.

Incidentally it turns out that even adding any deferred processing (in our case SQS+workers) on simple LOB systems can have nasty problems like clock skew and QoS to consider which are very distributed pains :)



> 20 years of insane coupling and complexity makes it pretty difficult to move away from that architecture which is where most companies seem to end up with from experience.

Pretty much. I don't want to generalise for legacy context however.

> I say we should design with distribution in mind but deploy without it.

Oh yes, absolutely. You can pretty reasonably predict when your compute needs will scale up such that you genuinely need to make the shift, and plan for it with a modular architecture.

> Assuming you can scale up forever is an expensive and stupid mistake.

See here's the thing. I think it has been, in the past.

I'm definitely sure that there will be applications in the future which will scale past single leader, multiple nodes.

What I've seen in the last few years was we just didn't ever have those demands (except for deep learning).

I wasn't working for small business, at a genuine mega-corp.

The total production data inflows and egresses of my mega corp, never peaked past I think 200MB/s in any year (for business critical systems. There was some user facing video that was jettison-able). The daily peak was far lower than that.

All production compute needs, bandwidth and compute, were DWARFED by employees on Zoom and Youtube.

The sum total of all proprietary OLTP data, across the company? And we had roughly 14 different legacy proprietary business systems from acquisitions. We had COBAL, we had DB2, we had it all.

The architect of our consolidation had it at less than 5TB (excluding photographs, backups and duplication etc).

40+ years of business critical OLTP data. And all the analytics was done on trailing views of the replicas.

Given the direction of Moore's law, I can safely say that for my former multi-billion dollar employer, Moore's law is going to outpace any of our business requirements.

(Except for photographs. But all our deep learning was being done by a spin out).


One size doesn’t fit all. Wait until you’re bought by another multi billion dollar employer and expected to scale out to their workload…


Oh definitely. I'm not saying it doesn't happen.


If you have good enough engineers and manpower for distributed systems, you can also optimize your systems enough to fit again, if you are not a billion dollar company.

Distributed systems are exciting and interesting, but currently not worth their investment for 99% of the cases. Things might change though in some decades.


Generally speaking, the second you're running tricks to "jam" something that's overly large into a single machine, yeah you should just give up.

The issue becomes when you reach the scale of compute needs such that you cannot "jam" all your requirements into a single leader multiple node cluster.

The amount of data you'll need for THAT, is becoming unimaginable.

64 core 128 thread leader nodes have outrageous potential to scale.

That said, that doesn't quite apply to my Postgres example (barring some nifty DBA tricks and third party extensions).

But you can appreciate how much data you're going to need to blow past what 128 threads can do for you.


Actually 20 years of shitty code built on assumptions like this can break an AWS 96 core instance running SQL server. That’s my life.

It does however shift 10k shitty queries per second which is impressive.


Been there done that. That gave us two years and cost a year of developer time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: