I'd still consider it as "performance issue", not "reliability issue". There is no service unavailability here. It just takes your system a minute longer until the target GPU capacity is available. Until then it runs on fewer GPU resources, which makes it slower. Hence performance.
The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really
Iād like to see a breakdown of the cost differences. If the costs are nearly equal, why would I not choose the one that has a faster startup time and fewer errors?
With GCP you can right-size the CPU and memory of the VM the GPU is attached to, unlike the fixed GPU AWS instances, so there is the potential for cost savings there.
The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really