So you did automatisation in a broken way. Here's one way to avoid the issues you described on bare metal:
- Only get servers with IPMI so you can remote reboot / power cycle them.
- Have said servers netboot so they always run the newest OS image.
- Make sure said OS image has a config that isn't broken so you don't get full inodes and so it cycles logs.
- Have the OS image include journalbeat to ship logs.
- Have your health checks trigger a recovery script that restarts or moves containers using one of a myriad of tools; monitoring isn't exactly a new discipline.
Yes, it means you have to have a build process for OS images. Yes, it means you need to pick a monitoring system. And yes, it means you need to decide a scheduling policy.
I wrote an orchestrator pre-K8S that was fewer LOC than the yaml config for my home test K8S cluster. Writing a custom orchestrator is often not hard, depending on your workload, - writing a generic one is.
K8S provides one opinionated version of what people build manually, and when it's a good fit, it's great. When it isn't, I all to often see people spend more time trying to figure out how to make it work for them than it would've taken them to do it from scratch.
I ran 1000+ VMs on a self developed orchestration mechanism for many years and it was trivial. This isn't a hard problem to solve, though many of the solutions will end up looking similar to some of the decisions made for K8S. E.g. pre-K8S we ran with an overlay network like K8S, and service discovery, like K8S, and an ingress based on Nginx like many K8S installs. There's certainly a reason why K8S looks the way it does, but K8S also has to be generic where you can often reasonably make other choices when you know your specific workload.
And you don't think k8s made your life much easier?
For me it's now much more about proper platform engineering and giving teams more flexibility again knowing that the k8s platform is significantly more stable than what I have ever seen before.
No, I don't for that setup. Trying to deploy K8S across what was a hybrid deployment across four countries and on prem, colo, managed services, and VMs would've been far.more effort than our custom system was (and the hw complexity was dictated by cost - running on AWS would've bankrupted that company)