Semantics aside, if an incident/outage/<other term> affects devs/users in any way, it really should be on the status page.
I found it impossible to distinguish between user error and platform outage. Too often it was a problem on fly's end yet the status page gave nothing (perplexed, I'd rerun the deploy a few hours later and it would work).
Can't stress it enough: if fly's services aren't working for any reason, big or small, put it on the status page. Devs need to know when something's not in their control so they can inform their team and customers and go to bed or take a coffee break. They shouldn't be up till 5am thinking they borked something when the problem is 100% on fly's end.
We have the same semantics. The incidents that are on the infra-log and not on the status page are things that didn't affect devs/users in any way. A good example: we have clusters of edge servers in every region we serve. They're servers, part of their job is to occasionally throw a rod. When that happens, we pull them out of service and stop advertising them. That's an internal incident, but it's not a status page incident. Preempting a response here: no, I don't think users would be better off if we status paged stuff like this.
Every week, we go through the incident management system, and all our incidents, whether or not they had user impact, get written up here.
One thing you're probably running in to, which is not on you but rather on us, that we are aware of, and that we are grinding steadily away at: as the platform has stabilized over the last year, an increasing share of problems users run into aren't "platform" issues (in the sense of: things going wrong in our API or on our cloud servers) but rather bugs in `flyctl`, our CLI.
We have two big reliability projects happening:
* We've finally landed our "white whale" of being able to move workloads that have attached volumes between different physicals. I'm writing a big long blog post about how that works now. We've been doing it for about 9 months, but we're at the point where we can do it automatically and at scale. Really hard to express how much not having this capability increased our degree of difficulty.
* We're taming `flyctl`; in the immediacy, just doing a better job of reporting errors (it's a Go program, we love Go, but showing "context deadline exceeded" to a user is, obviously, a bug, albeit one we didn't recognize as such until recently).
I don't have a way of addressing these kinds of concerns, which are valid, without just being clear about what we're up to and how we're thinking. I wrote a sort of "once-and-for-all" thing about this mentality here:
I found it impossible to distinguish between user error and platform outage. Too often it was a problem on fly's end yet the status page gave nothing (perplexed, I'd rerun the deploy a few hours later and it would work).
Can't stress it enough: if fly's services aren't working for any reason, big or small, put it on the status page. Devs need to know when something's not in their control so they can inform their team and customers and go to bed or take a coffee break. They shouldn't be up till 5am thinking they borked something when the problem is 100% on fly's end.