Haupz Blog

... still a totally disordered mix

The Alert That Cried "Problem"

2024-02-05 — Michael Haupt

Getting alerts right is hard. If they’re over-zealous, on-call folks will be woken up in the middle of the night for no really good reason. If they’re too hesitant, things go awry for too long before anyone notices.

It's easy to fall for configuring too many alerts "just to be sure". In such cases, some on-call supporters may implement workarounds that will wake them up only if the alert isn’t gone after 5 minutes.

But that kind of thing shouldn’t be necessary. In case of a serious outage, 5 minutes can be a lot of time. Any alert should be serious.

If that's not the case, there is a glitch in the system that needs to be sorted out. If the alerts are bogus, they should be changed, and if there’s an issue in the software or infrastructure, that needs to be fixed - and there has to be a solid idea of why it happens before measures are taken.

Here’s a set of pragmatic guidelines for alerts “that don’t suck” (it came to me via my former eBay buddy Mitch Wyle). Maybe one starting point.

Tags: work