We use this, pgbouncer, and a bash script to link the two, for completely automated failover.
Queries done through pgbouncer just pause as if the query is really really slow when the db goes down, then when pglookout does the failover, the bash script switches pgbouncer's config and those pending queries are sent immediately.
It's impossible to comment the specific circumstances with the report, but we do generally monitor and warn our users on running low on resources - be it storage, CPU or memory. I admit that the alerting is by no means foolproof; we may miss reaction on rapid or sudden changes in the usage patterns, but the alerting works quite well on more steady and common workloads and usage patterns.
Should the worst case scenario happen and the storage run out, there's always an option to upgrade to the next resource tier size. This will restore the DB state to the latest successfully recorded transaction.
We're using Kafka as a log delivery platform and are quite happy with it. Kafka by nature is highly available and can be scaled quite trivially with the log load by adding new cluster nodes.
We've decided to use journald for storing all of our application data. We pump the entries from journald to Kafka by using a tool that we open sourced: https://github.com/aiven/journalpump.
From Kafka, we export the logs to Elasticsearch for viewing and analysis. Some specific logs are also stored in S3 for long term storage for e.g. audit purposes.
Google recently added beta version of Postgres to Cloud SQL as well (https://cloud.google.com/sql/docs/postgres/). Of course, AWS RDS and Heroku have been on the market for some time.
I'd claim managed Postgres market is in a pretty good shape.
While not Microsoft offerings, there are some options available: my company Aiven (https://aiven.io/postgresql) offers managed PostgreSQL service on Azure as well. ElephantSQL (https://www.elephantsql.com/) is an another provider with hosted Postgres on Azure.
Well, you can get hosted Kafka as a service as well. My company Aiven offers one such service in AWS/Azure/Google Cloud Platform, DigitalOcean and UpCloud (https://aiven.io/kafka). CloudKarafka is an another vendor with managed offering and Heroku too just launched Kafka service as well.
I think Kinesis is an excellent service, but if you have an existing stack or preference on OpenSource solutions, it's good to have choice.
There's some indications that Google would support Postgres as part of the CLoudSQL in the future. But there are already multiple DB-as-a-Service offerings that provide PostgreSQL in Google Cloud. My company, Aiven (https://aiven.io) is one of the providers and I believe Compose, DatabaseLabs and ElephantSQL also provide managed PG in Google Cloud.
Heroku is cool, but there are other options too for managed Kafka. Check out https://aiven.io/kafka and https://www.cloudkafka.com/ for two providers starting at 3 brokers/90GB at $200/mo and 1 broker/20GB at $99/mo respectively.
It continues to run beautifully. Since we rolled it out back in 2015 we had zero issues with real time logging. I have particularly fond memories of the first week after rollout, it felt like vacation. I finally could get some sleep.
I'm in a similar boat. I'm hoping to propose Kafka to help with some data replication and consolidation tasks, but it has to be both on-premise and as low maintenance as possible (low maintenance in the sense of the work local developers would do).
To anyone reading this with Kafka experience, do you have any tips/advice when it comes to maintaining a Kafka service?
Use configuration management such as chef to allow you to quickly build new nodes and to roll out changes accross the cluster. You will need to make tweaks. The chef Kafka cookbook which is the top result on Google has means of coordinating restarts of brokers accross the cluster. Use consul as a locking mechanism for this. You could use zookeeper, but consul works well for auto DNS registration and auto discovery.
Use the yahoo Kafka-manager app to manage the cluster and to see what is going on.
Don't use the Kafka default of storing data in /tmp/. Your OS will periodically clean it.
Five kind of kills performance compared to three, and doesn't map well into AWS, where you generally have 3 or 4 AZ's. I tend to go with three but make sure you've got fully automated responses towards failures.
Zab [the distributed consensus algorithm that powers ZK] shares some similarities with Paxos, and requires a quorum of nodes to be online.
If you want highly available ZK, your choices are 3, 5, 7... nodes, for which you can have 1, 2, or 3 nodes offline at any one time.
If you have one node fully down on a 3 node cluster, and there is even a tiny network blip or partition (as often happens in cloud environments) then you are down.