Given the architecture demands and shortcomings noted in the article, it would likely be more efficient to cut out the overhead invoked by Cloud Run and AI Platform and just run everything in Kubernetes Engine (backed by Knative for Cloud Run-esque autoscaling). This would also solve the latency and scale-to-minimum size issues, and likely be cheaper in the long term at the cost of a bit more configuration to get it started.
Thanks for the tips. Having everything colocated in a k8s cluster will definitely help with latency and probably overall infra cost but it will be at the expense of engineering times spent on running a k8s cluster in prod.
Fingers crossed that we keep growing which would mean that we can justify working on v2 architecture.
It probably would - however we built this out with a team of 1-2 engineers, so paying a bit more and dealing with maybe 20 seconds of extra latency a day is a price we are willing to pay for expedited/easier deployment.
Thanks for the suggestion though! I'm sure we will consider that in the future :)
Can you elaborate on what changed from one architecture to the other regarding Custom Models? it feels like it would still be easier to do that on AWS since you control more of the stack. Isn't that the case?
While AWS does allow more hands-on control of the model stack, our "Custom Models" are more just different sets of weights using the exact same methodology. Basically, each customer that creates their own custom model is plugging into our existing framework, but with a different configuration.
Because of this, GCP'S AI Platform allows us a more micro-services type approach to interacting with the ML models themselves - as opposed to our previous deployment strategy on AWS, which put all of the models into one big bucket on every instance that was serving requests.
If for a while I eliminate the scaling issue with v0 AWS architecture. Would it be right to say in v0 issues like excessive load times were solved by decoupling models and flask app (e.x. making batch prediction calls to each necessary model for the current request?) rather than v0 architecture itself?
Was it that hard to make the make batch prediction calls to each necessary model for the current request on AWS?
Making the calls to the corresponding model from Flask was actually easier on AWS, since they were loaded into memory. Unfortunately, the scaling issues/excessive load times were big enough of an issue that we had to make the switch, as our number of hosted models continues to grow.
We supported batch api calls in v0. But as those API calls increased a new instanced would get spun up but boot time was longer. To get around it, we would have to keep more instances running all the time which obviously costs more money.
I am assuming you are talking about our deployment on GCP Cloud Run? We have thought about sending a heartbeat API call. It we notice any user experience friction because of this lag then we will definitely do that. As we said in the blog, this has not been a major pain point as of today.
Totally. We did not have a need for custom models initially. We could load all our models on one VM so there was no need. We were tempted to get on the bandwagon. ;-)
Not inherently, but how we're using it makes it cheaper. Our stack has ~50 different ML models being served live, and GCP makes it easy to treat each model as a micro-service, and give auto-scaling to each one.
This is in contrast to the easiest way we found to deploy the same architecture on AWS using Elastic Beanstalk, which involved one really big instance (that was constantly growing as we added more models), and the costs that come with that.
Because we were loading all the models startup time was long which meant that server would return 5xx errors which created more instability. We would had to do some engineering around it with a mix of config and code changes.
The bigger issue was that he had to use bigger machine as we added more custom ML models for our customers. New architecture gives us huge $$ saving and more visibility into performance of each model.
A bit of both. Flask was obviously not designed with serving Tensorflow models locally in mind, which is how we had it set up in v0. Towards the beginning we had to debug some weird threading issues, but towards the end once it was more stable (as a result of some hacky fixes), the timeouts were the main issue.