Raxx · internal docs

internal · gated ↑ index

ADR-0030 — Status Page State Machine: canonical states, transitions, and actors

Status: Accepted Date: 2026-04-28 UTC Refs: #601, #581, status-raxx-app.md


Context

The console dashboard's internal state vocabulary (HEALTHY, DEGRADED, FAILED, DOWN, STALE, UNKNOWN) was designed for an operator audience and for internal probe data. The public status page needs a state machine that: - Uses customer-legible language - Maps cleanly to the three classes of surface (owned, upstream 3P, downstream 3P) - Defines unambiguous transition rules so that automated probers, FreeScout webhooks, and manual overrides all produce consistent state

The question is: which states are canonical, what triggers each transition, and how do conflicts resolve when multiple actors drive state simultaneously?


Decision

Five canonical public states:

State Public label Meaning
OPERATIONAL Operational Surface is fully functional; probe passing; no open tickets
DEGRADED Degraded performance Surface is up but slower than normal, partially failing, or experiencing elevated errors
PARTIAL Partial outage A subset of functionality is unavailable (e.g., one region, one flow)
DOWN Service disruption Surface is not responding or is fully non-functional
MAINTENANCE Scheduled maintenance Planned downtime window; operator-initiated
UNKNOWN Status unknown No probe data yet; initial state; poller gap > threshold

Six states total (the issue brief mentioned five; MAINTENANCE is added as a distinct state because it requires a different public communication posture than DOWN).

Transition table (actor → from → to):

Actor Trigger From To
Automated prober Consecutive probe successes (N=3) any OPERATIONAL
Automated prober Single probe failure OPERATIONAL DEGRADED
Automated prober Consecutive probe failures (N=3) DEGRADED DOWN
Automated prober Latency > P95 threshold OPERATIONAL DEGRADED
Automated prober Probe gap > 90s any healthy UNKNOWN
FreeScout webhook ticket opened with component_tag OPERATIONAL DEGRADED (or PARTIAL/DOWN per ticket severity field)
FreeScout webhook ticket closed any OPERATIONAL (unless prober disagrees — see conflict rule)
3P poller Partner reports incident OPERATIONAL DEGRADED or DOWN (mapped from partner's status)
3P poller Partner reports resolution DEGRADED/DOWN OPERATIONAL
Operator manual override Console action any any (requires superadmin + audit log entry)
Operator schedule Maintenance window start any MAINTENANCE
Operator schedule Maintenance window end MAINTENANCE OPERATIONAL (prober re-checks within 60s)

Conflict resolution rule (multiple actors drive state simultaneously):

Priority order (highest wins): manual override > MAINTENANCE schedule > DOWN > PARTIAL > DEGRADED > UNKNOWN > OPERATIONAL

A FreeScout ticket close cannot transition a surface to OPERATIONAL if the automated prober is reporting DOWN. The prober must confirm recovery before the state clears. Specifically: - On ticket close: if prober state is OPERATIONAL or UNKNOWN → transition to OPERATIONAL. - On ticket close: if prober state is DEGRADED/DOWN → transition to DEGRADED (the ticket cleared, but the probe issue persists; the public note is updated to reflect probe-only degradation).

This prevents a closed ticket from masking an ongoing outage.

State for 3P surfaces:

3P surfaces (upstream and downstream) have no probe URL. Their state is driven by: - 3P poller (upstream only — per #608 evaluation) - FreeScout webhook (operator can manually open a 3P tracking ticket) - Manual override

3P surfaces default to OPERATIONAL and transition only when the 3P poller or an operator ticket says otherwise. They never go UNKNOWN from a probe gap (there is no probe).

Severity field on FreeScout tickets:

A third FreeScout custom field incident_severity (options: degraded, partial, down) drives which non-OPERATIONAL state a surface enters when a ticket is opened. If absent, defaults to degraded. This is documented in the #605 runbook.


Consequences

Positive: - MAINTENANCE as a distinct state prevents customers from reading a planned downtime as an unplanned incident. It is an important communication distinction. - The conflict resolution rule ensures that automated probe state cannot be silenced by a ticket close. No manual action can hide an ongoing outage. - The prober's N=3 hysteresis on both transition directions (failure → wait 3 cycles before escalating; success → wait 3 cycles before clearing) prevents flapping. - Mapping from the console's internal badge vocabulary to public states is deterministic: HEALTHY → OPERATIONAL, DEGRADED → DEGRADED, FAILED/DOWN → DOWN, STALE/UNKNOWN → UNKNOWN.

Negative: - Six states is slightly more complexity than five. PARTIAL and DEGRADED require operators to choose between them when opening tickets. The runbook must define the distinction clearly: PARTIAL = a feature is completely absent; DEGRADED = all features are present but slower or less reliable. - The conflict resolution priority means a DOWN prober state blocks an OPERATIONAL transition from ticket close. If the prober is wrong (e.g., probe URL changed but config wasn't updated), the state is stuck DOWN until the prober is fixed. This is correct behavior — a misconfigured probe is itself a signal that something needs attention.

Alternatives Rejected: