ADR-0030 — Status Page State Machine: canonical states, transitions, and actors

Status: Accepted Date: 2026-04-28 UTC Refs: #601, #581, status-raxx-app.md

Context

The console dashboard's internal state vocabulary (HEALTHY, DEGRADED, FAILED, DOWN, STALE, UNKNOWN) was designed for an operator audience and for internal probe data. The public status page needs a state machine that: - Uses customer-legible language - Maps cleanly to the three classes of surface (owned, upstream 3P, downstream 3P) - Defines unambiguous transition rules so that automated probers, FreeScout webhooks, and manual overrides all produce consistent state

The question is: which states are canonical, what triggers each transition, and how do conflicts resolve when multiple actors drive state simultaneously?

Decision

Five canonical public states:

State	Public label	Meaning
`OPERATIONAL`	Operational	Surface is fully functional; probe passing; no open tickets
`DEGRADED`	Degraded performance	Surface is up but slower than normal, partially failing, or experiencing elevated errors
`PARTIAL`	Partial outage	A subset of functionality is unavailable (e.g., one region, one flow)
`DOWN`	Service disruption	Surface is not responding or is fully non-functional
`MAINTENANCE`	Scheduled maintenance	Planned downtime window; operator-initiated
`UNKNOWN`	Status unknown	No probe data yet; initial state; poller gap > threshold

Six states total (the issue brief mentioned five; MAINTENANCE is added as a distinct state because it requires a different public communication posture than DOWN).

Transition table (actor → from → to):

Actor	Trigger	From	To
Automated prober	Consecutive probe successes (N=3)	any	OPERATIONAL
Automated prober	Single probe failure	OPERATIONAL	DEGRADED
Automated prober	Consecutive probe failures (N=3)	DEGRADED	DOWN
Automated prober	Latency > P95 threshold	OPERATIONAL	DEGRADED
Automated prober	Probe gap > 90s	any healthy	UNKNOWN
FreeScout webhook	ticket opened with component_tag	OPERATIONAL	DEGRADED (or PARTIAL/DOWN per ticket severity field)
FreeScout webhook	ticket closed	any	OPERATIONAL (unless prober disagrees — see conflict rule)
3P poller	Partner reports incident	OPERATIONAL	DEGRADED or DOWN (mapped from partner's status)
3P poller	Partner reports resolution	DEGRADED/DOWN	OPERATIONAL
Operator manual override	Console action	any	any (requires superadmin + audit log entry)
Operator schedule	Maintenance window start	any	MAINTENANCE
Operator schedule	Maintenance window end	MAINTENANCE	OPERATIONAL (prober re-checks within 60s)

Conflict resolution rule (multiple actors drive state simultaneously):

Priority order (highest wins): manual override > MAINTENANCE schedule > DOWN > PARTIAL > DEGRADED > UNKNOWN > OPERATIONAL

A FreeScout ticket close cannot transition a surface to OPERATIONAL if the automated prober is reporting DOWN. The prober must confirm recovery before the state clears. Specifically: - On ticket close: if prober state is OPERATIONAL or UNKNOWN → transition to OPERATIONAL. - On ticket close: if prober state is DEGRADED/DOWN → transition to DEGRADED (the ticket cleared, but the probe issue persists; the public note is updated to reflect probe-only degradation).

This prevents a closed ticket from masking an ongoing outage.

State for 3P surfaces:

3P surfaces (upstream and downstream) have no probe URL. Their state is driven by: - 3P poller (upstream only — per #608 evaluation) - FreeScout webhook (operator can manually open a 3P tracking ticket) - Manual override

3P surfaces default to OPERATIONAL and transition only when the 3P poller or an operator ticket says otherwise. They never go UNKNOWN from a probe gap (there is no probe).

Severity field on FreeScout tickets:

A third FreeScout custom field incident_severity (options: degraded, partial, down) drives which non-OPERATIONAL state a surface enters when a ticket is opened. If absent, defaults to degraded. This is documented in the #605 runbook.

Consequences

Positive: - MAINTENANCE as a distinct state prevents customers from reading a planned downtime as an unplanned incident. It is an important communication distinction. - The conflict resolution rule ensures that automated probe state cannot be silenced by a ticket close. No manual action can hide an ongoing outage. - The prober's N=3 hysteresis on both transition directions (failure → wait 3 cycles before escalating; success → wait 3 cycles before clearing) prevents flapping. - Mapping from the console's internal badge vocabulary to public states is deterministic: HEALTHY → OPERATIONAL, DEGRADED → DEGRADED, FAILED/DOWN → DOWN, STALE/UNKNOWN → UNKNOWN.

Negative: - Six states is slightly more complexity than five. PARTIAL and DEGRADED require operators to choose between them when opening tickets. The runbook must define the distinction clearly: PARTIAL = a feature is completely absent; DEGRADED = all features are present but slower or less reliable. - The conflict resolution priority means a DOWN prober state blocks an OPERATIONAL transition from ticket close. If the prober is wrong (e.g., probe URL changed but config wasn't updated), the state is stuck DOWN until the prober is fixed. This is correct behavior — a misconfigured probe is itself a signal that something needs attention.

Alternatives Rejected:

Ticket-wins-over-prober on close: Rejected because it allows a ticket close to mask a real outage. The prober must corroborate.
Three states only (OK / DEGRADED / DOWN): Rejected because MAINTENANCE is not optional — communicating planned downtime as an outage degrades customer trust. PARTIAL is also operationally meaningful for multi-feature services.
Prober drives all state, no manual override: Rejected because 3P surfaces have no probe, and some failure modes are not detectable by a simple HTTP probe (e.g., payment processing backlog, email deliverability issues). Human judgment must be expressible.