All writing
Notes on building

The Outage That Wasn't

At quarter to six in the morning, every system I own looked dead at once. The most useful instrument on the bench turned out to be the clock.

June 12, 2026 · 5 min

At 5:46 this morning, my operation looked like a crime scene.

The morning brief had not been written. The signals job that runs at dawn had nothing to show. The sales ingest had not posted yesterday's revenue. The watchdog log, the one file that should always be busy, ended yesterday. Every overnight system, missing at once. On paper: a dead morning.

I have been trained — by my own systems, by my own scars — to treat that kind of quiet as guilty until proven innocent. I once lost seventeen days to a pipeline that returned success on every run and wrote nothing at all. Since then the house rule has been simple: silence is a failure mode. When something that should have spoken says nothing, you go and look.

So the machine went and looked. And it very nearly wrote the incident report.

The cron doctor — silence is a failure mode Four cron jobs write heartbeat ticks into a ledger; one row has a gap, and the cron doctor flags the silent job and messages the operator. PRINCIPLE — THE CRON DOCTOR Watch the watchers Every job writes a heartbeat. A doctor reads them twice a day — silence rings a bell. The jobs Sales ingest7:15 Email labelerhourly Watchdog*/5 Backup02:00 Heartbeat ledger The doctor cron-doctor runs 7:00 + 19:00 Backup: silent 26 h iMessage to operator Silence is a failure mode — instrument for it

Two questions before the time of death

The discipline that stopped it is two questions, asked in order, before any verdict.

First: is the heartbeat alive? There is a watchdog that runs every five minutes, and its log had a fresh line — written sixty seconds earlier. Whatever was wrong, the patient had a pulse.

Second: what time is it?

It was 5:46 in the morning. The first job of the day was scheduled for 6:00. The brief writes at 7:30. The revenue ingest fires at 7:15.

Nothing was dead. Nothing was even late. The jobs had not failed to arrive; I had arrived before them. The crime scene was an empty restaurant photographed an hour before it opens.

The mirror image of the zombie

I have written before about the zombie signal — the alarm that outlives its truth, still flashing red months after the thing it watched was fixed. Zombies erode trust from one direction: they keep shouting after the evidence is gone.

This morning was the mirror image. A verdict that arrives before its evidence. Call it the premature autopsy: every fact was technically true — no brief, no rows, no signals — and the conclusion built from those facts was completely wrong, because a snapshot cannot tell the difference between absent and not yet.

Both bugs live on the same axis, and it is not the data axis. It is time. An alarm is only meaningful relative to a schedule, and a schedule has two failure modes: reading it after it stopped being true, and reading it before it starts being true.

You do this to people too

The reason this is worth a thousand words and not a line in a log file is that operators run the same broken query against everything, all day.

The new ad campaign is “dead” on day two — before the bidding system has finished learning. The vendor is “ghosting” because the quote did not come back by nine — he checks email at noon and four, like he told you. The new hire is “slow” in week one — of a job that takes a month to hold in your head. The empty dining room at 5:50 is “a bad night” — ten minutes before the dinner wave that has arrived, on schedule, for thirty years.

Half of judgment is sampling at the right moment. Absent and not-yet-due are pixel-identical in a single frame; they only separate when you know the schedule. “Late” is a fact about an appointment, not a fact about a heartbeat — and if you cannot say when something was due, you have no business saying it failed to show.

The cheap fix

The fix costs almost nothing, which is my favorite kind.

Every check in my morning sweep now answers in three states, not two: healthy, late, or not yet due. Not-yet-due is a first-class status with its own appointment time printed next to it. The dashboard I carry in my pocket stamps every number with “as of” and refuses to pretend a stale value is a fresh one — and the same stamp, read the other direction, refuses to pretend an early look is a missed deadline.

And the habit underneath the tooling: before declaring any outage — in a pipeline, in a person, in a Tuesday — read the clock first. Ask what time the thing was actually due. The answer is free, it is sitting right there, and at quarter to six this morning it was the only instrument on the bench that knew what was going on.

The systems were fine. The schedule was fine. The only thing that had fired early was the judgment.

/ar/