Status: Accepted Date: 2026-04-28 UTC Refs: #601, #581, status-raxx-app.md
The console dashboard's internal state vocabulary (HEALTHY, DEGRADED, FAILED, DOWN, STALE, UNKNOWN) was designed for an operator audience and for internal probe data. The public status page needs a state machine that:
- Uses customer-legible language
- Maps cleanly to the three classes of surface (owned, upstream 3P, downstream 3P)
- Defines unambiguous transition rules so that automated probers, FreeScout webhooks, and manual overrides all produce consistent state
The question is: which states are canonical, what triggers each transition, and how do conflicts resolve when multiple actors drive state simultaneously?
Five canonical public states:
| State | Public label | Meaning |
|---|---|---|
OPERATIONAL |
Operational | Surface is fully functional; probe passing; no open tickets |
DEGRADED |
Degraded performance | Surface is up but slower than normal, partially failing, or experiencing elevated errors |
PARTIAL |
Partial outage | A subset of functionality is unavailable (e.g., one region, one flow) |
DOWN |
Service disruption | Surface is not responding or is fully non-functional |
MAINTENANCE |
Scheduled maintenance | Planned downtime window; operator-initiated |
UNKNOWN |
Status unknown | No probe data yet; initial state; poller gap > threshold |
Six states total (the issue brief mentioned five; MAINTENANCE is added as a distinct state because it requires a different public communication posture than DOWN).
Transition table (actor → from → to):
| Actor | Trigger | From | To |
|---|---|---|---|
| Automated prober | Consecutive probe successes (N=3) | any | OPERATIONAL |
| Automated prober | Single probe failure | OPERATIONAL | DEGRADED |
| Automated prober | Consecutive probe failures (N=3) | DEGRADED | DOWN |
| Automated prober | Latency > P95 threshold | OPERATIONAL | DEGRADED |
| Automated prober | Probe gap > 90s | any healthy | UNKNOWN |
| FreeScout webhook | ticket opened with component_tag | OPERATIONAL | DEGRADED (or PARTIAL/DOWN per ticket severity field) |
| FreeScout webhook | ticket closed | any | OPERATIONAL (unless prober disagrees — see conflict rule) |
| 3P poller | Partner reports incident | OPERATIONAL | DEGRADED or DOWN (mapped from partner's status) |
| 3P poller | Partner reports resolution | DEGRADED/DOWN | OPERATIONAL |
| Operator manual override | Console action | any | any (requires superadmin + audit log entry) |
| Operator schedule | Maintenance window start | any | MAINTENANCE |
| Operator schedule | Maintenance window end | MAINTENANCE | OPERATIONAL (prober re-checks within 60s) |
Conflict resolution rule (multiple actors drive state simultaneously):
Priority order (highest wins): manual override > MAINTENANCE schedule > DOWN > PARTIAL > DEGRADED > UNKNOWN > OPERATIONAL
A FreeScout ticket close cannot transition a surface to OPERATIONAL if the automated prober is reporting DOWN. The prober must confirm recovery before the state clears. Specifically: - On ticket close: if prober state is OPERATIONAL or UNKNOWN → transition to OPERATIONAL. - On ticket close: if prober state is DEGRADED/DOWN → transition to DEGRADED (the ticket cleared, but the probe issue persists; the public note is updated to reflect probe-only degradation).
This prevents a closed ticket from masking an ongoing outage.
State for 3P surfaces:
3P surfaces (upstream and downstream) have no probe URL. Their state is driven by: - 3P poller (upstream only — per #608 evaluation) - FreeScout webhook (operator can manually open a 3P tracking ticket) - Manual override
3P surfaces default to OPERATIONAL and transition only when the 3P poller or an operator ticket says otherwise. They never go UNKNOWN from a probe gap (there is no probe).
Severity field on FreeScout tickets:
A third FreeScout custom field incident_severity (options: degraded, partial, down) drives which non-OPERATIONAL state a surface enters when a ticket is opened. If absent, defaults to degraded. This is documented in the #605 runbook.
Positive:
- MAINTENANCE as a distinct state prevents customers from reading a planned downtime as an unplanned incident. It is an important communication distinction.
- The conflict resolution rule ensures that automated probe state cannot be silenced by a ticket close. No manual action can hide an ongoing outage.
- The prober's N=3 hysteresis on both transition directions (failure → wait 3 cycles before escalating; success → wait 3 cycles before clearing) prevents flapping.
- Mapping from the console's internal badge vocabulary to public states is deterministic: HEALTHY → OPERATIONAL, DEGRADED → DEGRADED, FAILED/DOWN → DOWN, STALE/UNKNOWN → UNKNOWN.
Negative:
- Six states is slightly more complexity than five. PARTIAL and DEGRADED require operators to choose between them when opening tickets. The runbook must define the distinction clearly: PARTIAL = a feature is completely absent; DEGRADED = all features are present but slower or less reliable.
- The conflict resolution priority means a DOWN prober state blocks an OPERATIONAL transition from ticket close. If the prober is wrong (e.g., probe URL changed but config wasn't updated), the state is stuck DOWN until the prober is fixed. This is correct behavior — a misconfigured probe is itself a signal that something needs attention.
Alternatives Rejected: