Semantics aside, if an incident/outage/<other term> affects devs/users in *any* ...

tptacek · on July 4, 2024

We have the same semantics. The incidents that are on the infra-log and not on the status page are things that didn't affect devs/users in any way. A good example: we have clusters of edge servers in every region we serve. They're servers, part of their job is to occasionally throw a rod. When that happens, we pull them out of service and stop advertising them. That's an internal incident, but it's not a status page incident. Preempting a response here: no, I don't think users would be better off if we status paged stuff like this.

Every week, we go through the incident management system, and all our incidents, whether or not they had user impact, get written up here.

One thing you're probably running in to, which is not on you but rather on us, that we are aware of, and that we are grinding steadily away at: as the platform has stabilized over the last year, an increasing share of problems users run into aren't "platform" issues (in the sense of: things going wrong in our API or on our cloud servers) but rather bugs in `flyctl`, our CLI.

We have two big reliability projects happening:

* We've finally landed our "white whale" of being able to move workloads that have attached volumes between different physicals. I'm writing a big long blog post about how that works now. We've been doing it for about 9 months, but we're at the point where we can do it automatically and at scale. Really hard to express how much not having this capability increased our degree of difficulty.

* We're taming `flyctl`; in the immediacy, just doing a better job of reporting errors (it's a Go program, we love Go, but showing "context deadline exceeded" to a user is, obviously, a bug, albeit one we didn't recognize as such until recently).

I don't have a way of addressing these kinds of concerns, which are valid, without just being clear about what we're up to and how we're thinking. I wrote a sort of "once-and-for-all" thing about this mentality here:

https://news.ycombinator.com/item?id=39373476