Re. handling failure, we leave that up to an application/framework layer decision. When the backend is used for program state, the common approach is an auto-save loop that persists state externally (asynchronously) on a loop. If the backend is only used in a read-only way, the approach is to just recreate it on failure with the same parameters.
In general, Plane backends are meant to be used with thick clients, so there’s also the option to treat clients as nodes in a distributed system for the purpose of failure. If the server goes down and is replaced, when it comes back up, the nodes could buffer and replay any messages that may have been lost during the failure. Over time as we see patterns emerge, we may create frameworks from them (like aper.dev) to abstract the burden away from the application layer.
Time series metrics are exposed through Docker’s API, collectors for it already exist for various sinks. We will soon be sending some time series metrics over NATS to use internally for scheduling, but the Docker API will be better for external consumption because the collector ecosystem is already robust.
Resource caps can be defined at “spawn time”. They are not expected to have similar consumption, but the scheduler is not yet very smart, our approach currently is admittedly to overprovision. The scheduler is a big Q4 priority for us.
Draining currently involves terminating the “agent” process on the node, which stops the drone from advertising itself to the controller. Traffic still gets routed to backends running on that drone. We have an open issue[1] to implement a message to do this automatically.
> We will soon be sending some time series metrics over NATS to use internally for scheduling.
For what purpose?
> Re. handling failure
There are several operations that should be near seamless and very well thought out/handled reasonably including for the pieces of Plane itself.
Push new code
Roll back code
Push Canary code
shutdown -r now
add machine
remove machine
cluster wide restart
And for persistent data:
replace master
add replica
backup/restore backup
It seems like just about any product is going to have to implement some version of those things so it seems like there should be very well thought out story for each of them under the various conditions that prevent standard architectures.
A single point of failure is an extremely convenient architecture, but it is also a brittle and pain causing architecture that will resist scaling and the clean-ness of operations is probably the best window to assess that.
As far as architecture itself goes, why choose to use DNS rather than header based information/cookies? Why let the client choose the backend rather than hiding that as an infrastructure side implementation detail?
Does the service expose time series metrics?
How would I detect and remedy a hot shard?
Are resource caps well defined before hand/are all instances expected to have similar resource consumption?
What would administratively draining an instance look like?