This is a trend we noticed early last year, so we started building a single console for all these clouds at https://shadeform.ai.
It has been amazing to watch this industry explode, and we believe it is great for consumers. The same instances on Amazon versus these alternative providers are 3x more expensive.
NVIDIA and many hardware providers are leaning into this trend. As clouds become more and more vertically integrated, AMD, NVIDIA, and others will benefit from spreading their hardware to more clouds.
Knowing that these models will not be running in 3 easily controlled clouds may also benefit us in the long run as each provider will have different levels of comfort with models of varying capabilities.
If you’re looking for a new GPU cloud because of this, we’ve put all these providers in on place, console, and API so you can find a new one for your needs at https://shadeform.ai
I'm not GP but have been in this boat. We tried a number of different approaches, and kubernetes was by far the most successful. Terraform to provision the k8s clusters, workloads deployed to k8s. With OpenShift it gets even better, though I left the project before we finished implementation so I can't say how it went in prod. Early tests were very good though. If you're feeling bold you can do a "stretch" cluster which has nodes in different data centers (some on-prem for example, some in the cloud, or all in the cloud but different zones). The latency between the masters and nodes can cause problems though so I wouldn't separate them geographically very far.
We have aggregated most cloud GPU providers into a single platform, so you have an easier time getting the machines you need with our aggregated availability. You can check live availability and prices at https://shadeform.ai
Feel free to email at ed at shadeform dot ai