Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd still consider it as "performance issue", not "reliability issue". There is no service unavailability here. It just takes your system a minute longer until the target GPU capacity is available. Until then it runs on fewer GPU resources, which makes it slower. Hence performance.

The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really



I’d like to see a breakdown of the cost differences. If the costs are nearly equal, why would I not choose the one that has a faster startup time and fewer errors?


With GCP you can right-size the CPU and memory of the VM the GPU is attached to, unlike the fixed GPU AWS instances, so there is the potential for cost savings there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: