How has Flys reliability been as of late? From past discussions I got the impression that despite it being quite polished on the surface, behind the scenes their operations are a bit of a mess.
From personal experience, still not great. Sometimes things just don't deploy.
I host my little mobile braai simulator on it (braaisim.com) because it's an art project and easy to add new domains / certificates. Convenient to be able to scale up easily if it went viral in ZA. Nice that they support Johannesburg region (jnb), though, which is close to me.
You were downvoted, but as a user, I’ve experienced the same inconsistency with deploys.
There’s also a hard container size limit I’ve run into multiple times. If you add a dependency and go over the size limit, your app won’t deploy unless you switch to a GPU instance, which is substantially more expensive.
Don't switch to GPU instances just to get around the container size limit!
What are you trying to do and what's the size of your container? My instinctive response here is that you're not holding it the way we expect you to: that some of what you're sticking in your container, we'd expect you to keep in a volume.
I see two sibling comments with reports of patchy reliability, so I'll note that it's been rock-solid for me (multiple daily deploys every weekday) since I last commented on it (about a year ago?)
A bunch of mini outages and the main site + dashboard was out for maybe an hour last week on Thursday.
my production app has had 1 outage where stripe webhooks weren’t delivering for 45 minutes which is a huge issue for sure.
I mitigated by checking in addition to the webhook event and also it hasn’t had a problem since. Their explanation was a 3rd party networking issue at or near one of their data centers
Bad enough that we had to migrate our key customer-facing platform off of it because of the impact.
I love the idea and will still use it for many less-business-critical apps but after multiple late-night weird un-debuggable database outages we had to switch.
To provide at least one positive example, I'm using fly.io to host a website I made for a friend (for her college capstone project), so I'm on the now-legacy hobby plan paying nothing, and I have had zero problems.
I use it to power CI builds (a lot of them!) and have extremely little issue with it.
Basically I’m just using the API to spin up machines, which do some work and shut down. There’s some extra machines per build job, like database containers or a headless browser for testing. Pretty smooth in my experience.
I think the only occasional issue I hit is the internal DNS being a few seconds behind reality.
Semantics aside, if an incident/outage/<other term> affects devs/users in any way, it really should be on the status page.
I found it impossible to distinguish between user error and platform outage. Too often it was a problem on fly's end yet the status page gave nothing (perplexed, I'd rerun the deploy a few hours later and it would work).
Can't stress it enough: if fly's services aren't working for any reason, big or small, put it on the status page. Devs need to know when something's not in their control so they can inform their team and customers and go to bed or take a coffee break. They shouldn't be up till 5am thinking they borked something when the problem is 100% on fly's end.
We have the same semantics. The incidents that are on the infra-log and not on the status page are things that didn't affect devs/users in any way. A good example: we have clusters of edge servers in every region we serve. They're servers, part of their job is to occasionally throw a rod. When that happens, we pull them out of service and stop advertising them. That's an internal incident, but it's not a status page incident. Preempting a response here: no, I don't think users would be better off if we status paged stuff like this.
Every week, we go through the incident management system, and all our incidents, whether or not they had user impact, get written up here.
One thing you're probably running in to, which is not on you but rather on us, that we are aware of, and that we are grinding steadily away at: as the platform has stabilized over the last year, an increasing share of problems users run into aren't "platform" issues (in the sense of: things going wrong in our API or on our cloud servers) but rather bugs in `flyctl`, our CLI.
We have two big reliability projects happening:
* We've finally landed our "white whale" of being able to move workloads that have attached volumes between different physicals. I'm writing a big long blog post about how that works now. We've been doing it for about 9 months, but we're at the point where we can do it automatically and at scale. Really hard to express how much not having this capability increased our degree of difficulty.
* We're taming `flyctl`; in the immediacy, just doing a better job of reporting errors (it's a Go program, we love Go, but showing "context deadline exceeded" to a user is, obviously, a bug, albeit one we didn't recognize as such until recently).
I don't have a way of addressing these kinds of concerns, which are valid, without just being clear about what we're up to and how we're thinking. I wrote a sort of "once-and-for-all" thing about this mentality here:
I get the impression that the operations are a bit of a mess too, but that support is empowered to actually make people whole, which is a nice change of pace.
Database server connections just die and don't heal.
Apps get scaled down to 0 despite setting minimum running applications.
Our logshipper just stopped working for a week.
Our apps occasionally just stop working.
The worst part is that I could cope with the above if they communicated the issues.
They don't. They make major changes - "improvements", break stuff regularly, acknowledge that they broke stuff in the forums, but don't communicate to those affected.
I am seriously considering going elsewhere - and I really don;t want to. The intention is good, but I am not convinced that they are stable enough for people to gamble their businesses on it just yet...
e.g. https://news.ycombinator.com/item?id=36808296
https://news.ycombinator.com/item?id=39365735