ADR-0121 — Status Surface Registry: where and how is the surface list stored?
Status: Accepted Date: 2026-04-28 UTC Refs: #601, #581, status-raxx-app.md
Context
The surface registry is the canonical list of services tracked on status.raxx.app. Every tile on the status page maps to one registry entry. The registry drives:
- Which surfaces the poller checks
- What FreeScout component_tag values are valid
- What the public API enumerates
Four storage approaches were considered:
- Hardcoded Python list — a list of dicts (or dataclasses) in a module like
status/surfaces.py. Adding a surface = a code commit + deploy. - Postgres table — a
status_surfacestable, rows editable via admin UI or migration. - YAML/JSON config file in repo —
config/status-surfaces.yamlcommitted to the repo. Adding a surface = a code commit (but no deploy of Python logic, just config). - Pulled from FreeScout component tags — the surface list is derived from the
component_tagfield options in FreeScout.
Decision
YAML config file in repo (config/status-surfaces.yaml).
The surface list changes on a deployment cadence, not an operational cadence. Adding a new owned service (e.g., a new Heroku app) or removing a 3P integration is a deliberate platform decision that belongs in version control with a PR and review — not a runtime database mutation reachable to anyone with DB access.
YAML gives:
- Auditability: every surface addition or removal is a commit with an author, timestamp, and context. This is part of the audit trail requirement for any change that affects the public-facing status page.
- Simplicity: no migration required to add a field to the surface schema. The YAML schema evolves with the codebase.
- Operator workflow: operators add a surface by opening a PR to config/status-surfaces.yaml, which also triggers review and ensures the FreeScout runbook is updated in the same PR.
- Bootstrap: the config is available at server start without a DB query. The poller reads it at process start (and on SIGHUP for hot-reload without redeploy).
The YAML file is the single source of truth. The Postgres status_state table stores runtime state (current status, last probe result, ticket_pending) keyed by surface_id. The YAML is authoritative for what surfaces exist; Postgres is authoritative for what state they are in.
Consequences
Positive:
- Adding a surface requires only a YAML edit + PR — no migration, no DB admin console needed.
- The file is readable as documentation. Engineers and operators can see the full surface taxonomy by reading one file.
- Bootstrap is synchronous — server start always has the full registry without a cold-start DB read.
- The registry is version-controlled; rollback of a bad surface addition is git revert.
Negative:
- Adding a surface requires a code deploy. For a platform that changes surfaces infrequently (estimated: once per quarter), this is an acceptable constraint.
- FreeScout's component_tag dropdown options must be kept in sync with the YAML manually (the #605 runbook covers this). There is no automatic sync. This is a documented operational dependency.
- If the surface list grows very large (>100 entries), YAML readability degrades. This is not a near-term concern.
Alternatives Rejected:
- Hardcoded Python list: Rejected for the same reason YAML is accepted — but worse: a Python list change requires a logic deploy rather than just a config change, and the diff is harder to review.
- Postgres table: Rejected because runtime mutability of the surface registry creates an ops risk. An accidental delete or a rogue API call could blank the status page. The registry is not operational data; it is configuration. Configuration belongs in code.
- Pulled from FreeScout: Rejected because it inverts the dependency. FreeScout should derive its
component_tagoptions from the canonical registry, not the other way. The registry must be authoritative for the status system; FreeScout is a downstream consumer of it.
Schema
See status-raxx-app.md §3 for the full YAML field specification and example entries.
Key fields per entry: id, display_name, category (owned | upstream_3p | downstream_3p), probe_url (nullable), partner_status_url (nullable), public_description, partner_name (for 3P only — the partner's own product name, used in "tracking [partner]'s incident" copy).
Amendment — public field (2026-05-05, #1149)
A public: bool field (default true) was added to the surface entry schema following a security audit that identified four operator-internal surfaces leaking on the customer status page.
Semantics
public value |
Customer status page | Operator console | Probes + 3P poller |
|---|---|---|---|
true (or absent) |
Visible | Visible | Active |
false |
Hidden | Visible | Active |
public: false means hidden from customers, not removed from monitoring. Probes and the 3P poller continue to write state for public: false surfaces. The operator console and /api/internal/status/* endpoints always see all surfaces.
Enforcement
- Backend (
status_registry.py):get_public_surface_registry()filters topublic != false. All/api/status/public/*routes must call this function, notget_surface_registry(). The fullget_surface_registry()is used by the prober, the 3P poller, and the webhook receiver so they accept state updates for all surfaces. - Worker (
frontend/status-worker/src/index.ts):INTERNAL_ONLY_SURFACE_IDS(aSet<string>) holds the IDs ofpublic: falsesurfaces.handleGetSurfacesfilters D1 rows through this set before building the response. The set must be kept in sync with the YAML whenever thepublicflag changes. is_known_surface(): Always checks the full registry — internal surfaces must be accepted by the webhook receiver.
Surfaces currently marked public: false
| ID | Reason |
|---|---|
console-raxx-app |
Exposes existence and uptime of admin panel |
internal-docs-raxx-app |
Name announces it is internal; no customer relevance |
vault-raxx-app |
Exposes secrets infrastructure to attackers; "Secrets Vault" is an enumeration vector |
ci-runners |
Internal build infrastructure; no customer relevance |
Adding or changing public: false surfaces
- Update
config/status-surfaces.yamlandbackend_v2/status/surface_registry.yaml(add/removepublic: false). - Update
INTERNAL_ONLY_SURFACE_IDSinfrontend/status-worker/src/index.ts. - Update
INTERNAL_ONLY_IDSinbackend_v2/tests/unit/test_status_surface_registry.pyand the mirror set infrontend/status-worker/test/worker.test.ts. - Open a PR — the test suite enforces the contract so CI will catch drift.