Status: Accepted (updated 2026-04-30 UTC — D1 migration per #646) Date: 2026-04-28 UTC Owner: software-architect Parent epic: #581 Design card: #601 ADRs: 0028 · 0029 · 0030 Downstream sub-cards: #602 (infra/DNS) · #603 (public API backend) · #604 (React app) · #605 (FreeScout fields) · #607 (webhook receiver) · #608 (3P poller eval) · #646 (D1 migration)
D1 UPDATE (2026-04-30, #646): The runtime state tables (
status_state,status_incidents,status_audit_log) have migrated from the shared Heroku Postgres instance to Cloudflare D1 (raxx-status-db), served by a CF Worker (raxx-status-worker) atstatus.raxx.app/api/*. All references to "Postgres" in this document refer to the superseded architecture. The current schema is SQLite-compatible and identical in structure. Writers (3P poller, FreeScout webhook) POST to the Worker's internal HTTP API instead of writing to a local DB. Seedocs/ops/runbooks/status-d1.mdfor operational details.
status.raxx.app is a customer-facing page showing the live health of every service Raxx operates or depends on. It is the communication channel between the ops team and users during incidents. It must be reachable when things are broken, must never expose operator-internal data, and must reflect ticket updates within 5 minutes.
This document covers the foundational layer: surface registry, state machine, data model, FreeScout webhook contract, and public API shape. All downstream sub-cards build on these definitions.
All platform invariants apply. Status-specific constraints:
status_audit_log with actor, timestamp, previous state, new state, and source (prober | freescout | manual | schedule).ticket_pending: true/false, public_note (operator-written summary, max 280 chars), and resolved_at on historical entries.config/status-surfaces.yaml)# config/status-surfaces.yaml
# One entry per status-page surface.
# id must be a stable slug — never change after first deploy.
# Changing id requires a migration of status_state rows and FreeScout dropdown options.
surfaces:
- id: raxx-app
display_name: "raxx.app (marketing site)"
category: owned # owned | upstream_3p | downstream_3p
probe_url: "https://raxx.app/health"
partner_status_url: null
public_description: "Raxx marketing and public website."
partner_name: null
- id: getraxx-com
display_name: "getraxx.com"
category: owned
probe_url: "https://getraxx.com/"
partner_status_url: null
public_description: "Raxx marketing domain redirect."
partner_name: null
- id: app-raxx-app
display_name: "Raxx App"
category: owned
probe_url: "https://app.raxx.app/health"
partner_status_url: null
public_description: "The main Raxx trading application."
partner_name: null
- id: demo-raxx-app
display_name: "Demo"
category: owned
probe_url: "https://demo.raxx.app/health"
partner_status_url: null
public_description: "Raxx demo environment."
partner_name: null
- id: docs-raxx-app
display_name: "Documentation"
category: owned
probe_url: "https://docs.raxx.app/"
partner_status_url: null
public_description: "Raxx public documentation."
partner_name: null
- id: internal-docs-raxx-app
display_name: "Internal Docs"
category: owned
probe_url: "https://internal-docs.raxx.app/"
partner_status_url: null
public_description: "Internal team documentation."
partner_name: null
- id: console-raxx-app
display_name: "Operator Console"
category: owned
probe_url: "https://console.raxx.app/health"
partner_status_url: null
public_description: "Internal operator console."
partner_name: null
- id: tickets-raxx-app
display_name: "Support Tickets"
category: owned
probe_url: "https://tickets.raxx.app/"
partner_status_url: null
public_description: "Customer support ticketing system."
partner_name: null
- id: status-raxx-app
display_name: "Status Page"
category: owned
probe_url: "https://status.raxx.app/"
partner_status_url: null
public_description: "This status page."
partner_name: null
# Email flows
- id: email-outbound-transactional
display_name: "Outbound Email (Transactional)"
category: owned
probe_url: null # no HTTP probe; state driven by FreeScout ticket or 3P poller
partner_status_url: null
public_description: "Account confirmations, notifications, and password-reset emails from no-reply@raxx.app."
partner_name: null
- id: email-inbound-support
display_name: "Support Inbox"
category: owned
probe_url: null
partner_status_url: null
public_description: "Inbound email to support@raxx.app and ticket creation."
partner_name: null
# Internal services
- id: api-service
display_name: "API Service"
category: owned
probe_url: "https://api.raxx.app/health"
partner_status_url: null
public_description: "Core Raxx API (data, backtesting, account management)."
partner_name: null
- id: vault-raxx-app
display_name: "Secrets Vault"
category: owned
probe_url: "https://vault.raxx.app/api/v1/health"
partner_status_url: null
public_description: "Internal credential and secrets management. Degradation does not directly affect trading."
partner_name: null
- id: ci-runners
display_name: "CI / Deployment Pipeline"
category: owned
probe_url: null
partner_status_url: null
public_description: "GitHub Actions CI and deployment workflows."
partner_name: null
# Upstream 3P (we depend on them for our own operation)
- id: cloudflare
display_name: "Cloudflare"
category: upstream_3p
probe_url: null
partner_status_url: "https://www.cloudflarestatus.com/api/v2/summary.json"
public_description: "CDN, DDoS protection, and access management. Cloudflare incidents may affect all Raxx-owned sites."
partner_name: "Cloudflare"
- id: heroku-platform
display_name: "Hosting Platform"
category: upstream_3p
probe_url: null
partner_status_url: "https://status.heroku.com/api/v4/current-status"
public_description: "Cloud hosting for Raxx application services. Platform incidents may affect app availability."
partner_name: "Heroku"
- id: email-delivery
display_name: "Email Delivery"
category: upstream_3p
probe_url: null
partner_status_url: "https://status.postmarkapp.com/api/v2/summary.json"
public_description: "Email delivery infrastructure. Incidents may delay transactional emails."
partner_name: "Postmark"
- id: billing-payment
display_name: "Billing & Payments"
category: upstream_3p
probe_url: null
partner_status_url: "https://status.stripe.com/api/v2/summary.json"
public_description: "Payment processing. Billing incidents may affect subscription management."
partner_name: "Stripe"
- id: workspace
display_name: "Team Workspace"
category: upstream_3p
probe_url: null
partner_status_url: "https://www.google.com/appsstatus/rss/en"
public_description: "Collaboration tools used by the Raxx team. Workspace incidents do not affect trading."
partner_name: "Google Workspace"
- id: secrets-management
display_name: "Secrets Management"
category: upstream_3p
probe_url: null
partner_status_url: null # Infisical Cloud status URL — evaluate in #608
public_description: "Managed secrets infrastructure. Incidents may affect credential rotation, not active trading."
partner_name: "Infisical"
# Downstream 3P (user-facing; we route to them; generic names per invariant)
- id: broker-connectivity
display_name: "Broker Connectivity"
category: downstream_3p
probe_url: null
partner_status_url: null # no single partner URL — per BYOB hybrid; evaluate per #608
public_description: "Connection to your brokerage for order routing and account data. Your trades are unaffected during Raxx-side degradations; broker-side incidents are noted here."
partner_name: null # intentionally null — generic copy only
- id: market-data-feed
display_name: "Market Data Feed"
category: downstream_3p
probe_url: null
partner_status_url: null
public_description: "Real-time and historical market data. Feed disruptions may delay chart updates and backtest data."
partner_name: null
- id: trade-execution
display_name: "Trade Execution Path"
category: downstream_3p
probe_url: null
partner_status_url: null
public_description: "Order submission and execution. Disruptions are noted here with status from your connected broker."
partner_name: null
status_state)Postgres table. One row per surface_id. Written by: probe worker, FreeScout webhook receiver, 3P poller, manual override endpoint.
CREATE TABLE status_state (
surface_id TEXT PRIMARY KEY, -- FK to YAML registry id
state TEXT NOT NULL -- OPERATIONAL|DEGRADED|PARTIAL|DOWN|MAINTENANCE|UNKNOWN
CHECK (state IN ('OPERATIONAL','DEGRADED','PARTIAL','DOWN','MAINTENANCE','UNKNOWN')),
state_since TIMESTAMPTZ NOT NULL DEFAULT NOW(),
state_source TEXT NOT NULL -- prober|freescout|3p_poller|manual|schedule
CHECK (state_source IN ('prober','freescout','3p_poller','manual','schedule')),
ticket_pending BOOLEAN NOT NULL DEFAULT FALSE,
public_note TEXT, -- operator-written, max 280 chars; NULL if no active note
last_probe_ok BOOLEAN, -- NULL for surfaces with no probe_url
last_probe_at TIMESTAMPTZ,
last_probe_latency_ms INTEGER,
maintenance_until TIMESTAMPTZ, -- non-null only when state=MAINTENANCE
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE status_audit_log (
id BIGSERIAL PRIMARY KEY,
surface_id TEXT NOT NULL,
actor TEXT NOT NULL, -- 'prober'|'freescout'|'3p_poller'|operator_id
previous_state TEXT,
new_state TEXT NOT NULL,
source TEXT NOT NULL,
note TEXT, -- human-readable reason (optional)
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON status_audit_log (surface_id, created_at DESC);
CREATE TABLE status_incidents (
id BIGSERIAL PRIMARY KEY,
surface_id TEXT NOT NULL,
opened_at TIMESTAMPTZ NOT NULL,
resolved_at TIMESTAMPTZ, -- NULL = still open
public_note TEXT, -- final operator-written note at close
freescout_ticket_id BIGINT, -- internal reference only, never exposed publicly
severity TEXT CHECK (severity IN ('degraded','partial','down','maintenance')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON status_incidents (surface_id, resolved_at DESC);
-- Incidents older than 30 days are moved to cold storage or deleted by a nightly job.
-- The 30-day window is a product requirement from #581.
| State | Public label | Badge color | Description |
|---|---|---|---|
| OPERATIONAL | Operational | Green | Fully functional |
| DEGRADED | Degraded performance | Amber | Functional but impaired |
| PARTIAL | Partial outage | Orange | Subset of functionality unavailable |
| DOWN | Service disruption | Red | Non-functional |
| MAINTENANCE | Scheduled maintenance | Blue/slate | Planned downtime |
| UNKNOWN | Status unknown | Slate | No data yet or probe gap |
stateDiagram-v2
[*] --> UNKNOWN : surface registered
UNKNOWN --> OPERATIONAL : prober(N=3 ok)
UNKNOWN --> DEGRADED : freescout(ticket open, severity=degraded)
UNKNOWN --> DOWN : freescout(ticket open, severity=down)
OPERATIONAL --> DEGRADED : prober(1 failure) | freescout(ticket open, severity=degraded) | 3p_poller(partner degraded)
OPERATIONAL --> PARTIAL : freescout(ticket open, severity=partial)
OPERATIONAL --> DOWN : prober(N=3 failures) | freescout(ticket open, severity=down) | 3p_poller(partner down)
OPERATIONAL --> MAINTENANCE : schedule(window start) | manual
OPERATIONAL --> UNKNOWN : prober(gap > 90s)
DEGRADED --> OPERATIONAL : prober(N=3 ok) AND no open ticket
DEGRADED --> DOWN : prober(N=3 failures) | freescout(severity escalated to down)
DEGRADED --> MAINTENANCE : schedule(window start) | manual
DEGRADED --> UNKNOWN : prober(gap > 90s, no open ticket)
PARTIAL --> OPERATIONAL : freescout(ticket close) AND prober(ok or no probe)
PARTIAL --> DOWN : freescout(severity escalated)
PARTIAL --> MAINTENANCE : manual
DOWN --> DEGRADED : prober(N=3 ok) AND open ticket exists
DOWN --> OPERATIONAL : prober(N=3 ok) AND no open ticket
DOWN --> MAINTENANCE : schedule | manual
MAINTENANCE --> OPERATIONAL : schedule(window end) — prober confirms within 60s
MAINTENANCE --> DOWN : prober(failure at window end)
note right of OPERATIONAL
Conflict rule: manual override > MAINTENANCE > DOWN > PARTIAL > DEGRADED > UNKNOWN > OPERATIONAL
end note
The prober applies N=3 hysteresis in both directions to prevent flapping: - A surface transitions from OPERATIONAL to DEGRADED after 1 consecutive probe failure (fast detection). - A surface transitions from DEGRADED to DOWN after 3 consecutive probe failures. - A surface transitions from DOWN/DEGRADED to OPERATIONAL after 3 consecutive probe successes.
Probes run every 60 seconds. Maximum detection latency from first failure to DEGRADED: 60s. Maximum recovery latency from last failure to OPERATIONAL: 3 minutes.
FreeScout fires a webhook on ticket create, update, and close. The webhook receiver (POST /api/webhooks/freescout) on raxx-api-prod processes the event, updates status_state, writes to status_incidents, and then signals the CF Worker cache invalidation endpoint.
FreeScout webhook version: v1 (standard JSON POST). FreeScout supports HMAC-SHA256 signature verification; use this (not plaintext shared secret). The signature is in the X-FreeScout-Signature header as sha256=<hex>.
Per #605, two custom fields are required on FreeScout tickets (and a third added by ADR-0030):
| FreeScout field slug | Type | Purpose |
|---|---|---|
component_tag |
Select (dropdown) | Maps ticket to a surface_id from the registry |
public_status |
Text (max 280 chars) | Operator-written public note mirrored to status page |
incident_severity |
Select | degraded / partial / down; drives state transition target |
The field slugs above are the slugs the webhook receiver reads. They must match exactly. The #605 runbook documents how to create these fields and verify their slugs via FreeScout API.
FreeScout POSTs a JSON body. The receiver only reads the fields below; all others are ignored.
{
"event": "ticket.created", // ticket.created | ticket.updated | ticket.status_changed
"ticket": {
"id": 4821, // FreeScout numeric ticket ID — stored in status_incidents, never exposed publicly
"status": "active", // active | pending | closed
"customFields": [
{
"id": 12, // FreeScout field ID — not used; slug is canonical
"name": "Component Tag",
"slug": "component_tag",
"value": "app-raxx-app" // must match a surface_id in the YAML registry
},
{
"id": 13,
"name": "Public Status",
"slug": "public_status",
"value": "Login flow is intermittently failing. We are investigating."
},
{
"id": 14,
"name": "Incident Severity",
"slug": "incident_severity",
"value": "degraded" // degraded | partial | down
}
]
}
}
function handle_freescout_webhook(payload):
1. Verify HMAC-SHA256 signature → HTTP 401 if invalid
2. Extract surface_id = customFields["component_tag"].value
3. If surface_id not in registry → log warning, return HTTP 200 (graceful no-op)
4. Extract public_note = customFields["public_status"].value (strip to 280 chars)
5. Extract severity = customFields["incident_severity"].value (default "degraded")
6. Extract ticket_status = payload.ticket.status
if ticket_status in ("active", "pending"):
7a. Upsert status_state:
ticket_pending = true
public_note = public_note
state = resolve_ticket_state(current_prober_state, severity)
state_source = "freescout"
updated_at = now()
7b. Upsert status_incidents:
opened_at = now() if new; leave if existing
severity = severity
public_note = public_note
7c. Write status_audit_log row
7d. Trigger CF cache invalidation for the surface
if ticket_status == "closed":
8a. Determine new state:
if last_probe_ok is true or (probe_url is null): new_state = "OPERATIONAL"
else: new_state = "DEGRADED" (probe still failing — ticket closed but issue persists)
8b. Update status_state:
ticket_pending = false
public_note = null
state = new_state
state_source = "freescout"
updated_at = now()
8c. Close status_incidents row:
resolved_at = now()
public_note = public_note (the final note at close)
8d. Write status_audit_log row
8e. Trigger CF cache invalidation
function resolve_ticket_state(prober_state, severity):
if prober_state == "DOWN": return "DOWN" // probe wins; don't downgrade
if severity == "down": return "DOWN"
if severity == "partial": return "PARTIAL"
return "DEGRADED" // default for severity=degraded or unset
The FreeScout webhook fires on every ticket update, including field edits. The receiver must be idempotent:
- status_state rows are upserted on surface_id. Multiple webhook calls for the same ticket and same state are no-ops at the DB level (no duplicate rows).
- status_incidents rows use freescout_ticket_id as a unique key — INSERT ... ON CONFLICT (freescout_ticket_id) DO UPDATE prevents duplicate incident rows.
- Cache invalidation is idempotent (CF purge by URL is safe to call multiple times).
If FreeScout retries a webhook (e.g., after a timeout), the second call produces identical state. No deduplication queue is needed.
FreeScout retries failed webhook deliveries (non-2xx response) up to 3 times with exponential backoff. The receiver must return HTTP 200 for all successfully processed events, including graceful no-ops (unknown surface_id). HTTP 5xx causes a retry.
A separate "webhook retry queue" sub-card is not needed at this scale. If FreeScout's built-in retry is insufficient (e.g., if the receiver itself is down for an extended period), the operator can manually re-trigger by updating any field on the affected ticket.
All endpoints are served by the Cloudflare Worker (per ADR-0028). The Worker fetches fresh data from GET /internal/status/snapshot on raxx-api-prod and caches the response.
GET /api/status/public/surfaces
Returns the current state of all surfaces. Cache: 60s.
{
"generated_at": "2026-04-28T14:30:00Z", // ISO 8601 UTC
"overall_status": "DEGRADED", // worst non-OPERATIONAL state across owned surfaces
"surfaces": [
{
"id": "app-raxx-app",
"display_name": "Raxx App",
"category": "owned", // owned | upstream_3p | downstream_3p
"state": "DEGRADED", // OPERATIONAL|DEGRADED|PARTIAL|DOWN|MAINTENANCE|UNKNOWN
"state_since": "2026-04-28T13:55:00Z",
"ticket_pending": true,
"public_note": "Login flow is intermittently failing. We are investigating.",
"public_description": "The main Raxx trading application.",
"last_checked_at": "2026-04-28T14:29:45Z", // null for surfaces with no probe_url
"maintenance_until": null // ISO 8601 UTC or null
},
{
"id": "cloudflare",
"display_name": "Cloudflare",
"category": "upstream_3p",
"state": "OPERATIONAL",
"state_since": "2026-04-01T00:00:00Z",
"ticket_pending": false,
"public_note": null,
"public_description": "CDN, DDoS protection, and access management. Cloudflare incidents may affect all Raxx-owned sites.",
"last_checked_at": null,
"maintenance_until": null
},
{
"id": "broker-connectivity",
"display_name": "Broker Connectivity",
"category": "downstream_3p",
"state": "OPERATIONAL",
"state_since": "2026-04-01T00:00:00Z",
"ticket_pending": false,
"public_note": null,
"public_description": "Connection to your brokerage for order routing and account data.",
"last_checked_at": null,
"maintenance_until": null
}
]
}
Fields that are never present in the public response: probe_url, partner_status_url, last_probe_latency_ms, freescout_ticket_id, partner_name for downstream_3p (generic copy only), any internal hostname or error string from the probe.
GET /api/status/public/incidents
Returns resolved incidents from the last 30 days. Cache: 300s.
{
"generated_at": "2026-04-28T14:30:00Z",
"incidents": [
{
"id": "inc_00421", // opaque identifier — not the FreeScout ticket ID
"surface_id": "app-raxx-app",
"surface_display_name": "Raxx App",
"severity": "degraded",
"opened_at": "2026-04-25T10:15:00Z",
"resolved_at": "2026-04-25T11:42:00Z",
"public_note": "Login flow intermittently failed. Root cause identified and resolved."
}
]
}
Active (unresolved) incidents are represented by ticket_pending: true on the surface in /surfaces, not in the /incidents list. The /incidents list is historical only.
GET /api/status/public/widgets/market-time
Returns current market time data for the MarketTimeWidget. Cache: 30s.
{
"generated_at": "2026-04-28T14:30:00Z",
"utc_now": "2026-04-28T14:30:00Z",
"et_now": "2026-04-28T10:30:00-04:00",
"market_state": "open", // open | closed | pre-market | after-hours | holiday
"next_event": {
"event": "market_close",
"at": "2026-04-28T20:00:00Z"
},
"holiday": null // or {"name": "Independence Day", "date": "2026-07-04"}
}
The internal /api/status/ endpoints (operator-only, CF Access gated) include additional fields not present in the public response: probe_url, last_probe_latency_ms, last_probe_status_code, last_probe_error, freescout_ticket_id, sentry_errors_24h, build_status, last_deploy, state_audit_log.
The Worker emits:
Access-Control-Allow-Origin: https://status.raxx.app
Access-Control-Allow-Methods: GET
Access-Control-Max-Age: 86400
| Layer | Interval | Notes |
|---|---|---|
| Probe worker (owned surfaces with probe_url) | 60s | Runs in Raptor (raxx-api-prod background thread or cron); writes status_state |
| 3P poller (upstream partners per #608) | 300s | Separate background job; reads partner status APIs; writes status_state |
| CF Worker cache (surfaces) | 60s max-age | Invalidated on state change by webhook receiver |
| CF Worker cache (incidents) | 300s max-age | Invalidated on incident close |
| CF Worker cache (market-time) | 30s max-age | Time-sensitive; short TTL |
| Status page React app polling | 120s | Client-side interval; reads the Worker endpoint |
The status page frontend polls the CF Worker every 120 seconds. Combined with the 60s Worker cache, the maximum end-to-end latency for a probe-detected state change to reach a browser is 180s (3 minutes). FreeScout-driven changes trigger cache invalidation, reducing this to the time for the Worker to propagate (typically <5s), well within the 5-minute SLA.
sequenceDiagram
actor Operator
participant FreeScout
participant WebhookReceiver as Raptor<br/>POST /api/webhooks/freescout
participant DB as Postgres<br/>status_state
participant AuditLog as Postgres<br/>status_audit_log
participant CFWorker as CF Worker<br/>/api/status/public
participant Browser as Customer Browser<br/>status.raxx.app
Operator->>FreeScout: Updates public_status field on ticket
FreeScout->>WebhookReceiver: POST webhook (ticket.updated)
WebhookReceiver->>WebhookReceiver: Verify HMAC signature
WebhookReceiver->>DB: Upsert status_state (ticket_pending=true, public_note=...)
WebhookReceiver->>AuditLog: INSERT audit row (actor=freescout, new_state=DEGRADED)
WebhookReceiver->>CFWorker: POST /internal/cache/invalidate?surface=app-raxx-app
WebhookReceiver-->>FreeScout: HTTP 200
Browser->>CFWorker: GET /api/status/public/surfaces (polling interval)
CFWorker->>WebhookReceiver: GET /internal/status/snapshot (cache miss)
CFWorker-->>Browser: JSON with ticket_pending=true, public_note="..."
Note over Browser: Status page shows "Pending issue" badge<br/>within one polling cycle (≤120s after cache invalidation)
sequenceDiagram
participant PartnerAPI as Partner Status API<br/>(e.g., status.heroku.com)
participant Poller3P as Raptor<br/>3P Poller (5-min cron)
participant DB as Postgres<br/>status_state
participant AuditLog as Postgres<br/>status_audit_log
participant CFWorker as CF Worker
participant Browser as Customer Browser
loop Every 5 minutes
Poller3P->>PartnerAPI: GET partner status JSON
PartnerAPI-->>Poller3P: {status: "major_outage"}
end
Poller3P->>DB: UPDATE status_state SET state=DOWN, state_source=3p_poller WHERE surface_id=heroku-platform
Poller3P->>AuditLog: INSERT audit row (source=3p_poller, new_state=DOWN)
Poller3P->>CFWorker: POST /internal/cache/invalidate?surface=heroku-platform
Browser->>CFWorker: GET /api/status/public/surfaces
CFWorker-->>Browser: JSON with heroku-platform.state=DOWN,<br/>public_description="We're tracking Heroku's incident — our service resumes when they recover."
Note over Browser: 3P tile shows "Service disruption"<br/>with partner-incident copy
sequenceDiagram
actor Operator
participant FreeScout
participant WebhookReceiver as Raptor<br/>POST /api/webhooks/freescout
participant DB as Postgres<br/>status_state
participant Incidents as Postgres<br/>status_incidents
participant AuditLog as Postgres<br/>status_audit_log
participant CFWorker as CF Worker
participant Browser as Customer Browser
Operator->>FreeScout: Closes ticket (marks resolved)
FreeScout->>WebhookReceiver: POST webhook (ticket.status_changed, status=closed)
WebhookReceiver->>WebhookReceiver: Verify HMAC; check prober state
Note over WebhookReceiver: prober state = OPERATIONAL → transition to OPERATIONAL
WebhookReceiver->>DB: UPDATE status_state SET ticket_pending=false, public_note=null, state=OPERATIONAL
WebhookReceiver->>Incidents: UPDATE status_incidents SET resolved_at=now(), public_note=final_note
WebhookReceiver->>AuditLog: INSERT audit row (source=freescout, new_state=OPERATIONAL)
WebhookReceiver->>CFWorker: POST /internal/cache/invalidate (surfaces + incidents)
WebhookReceiver-->>FreeScout: HTTP 200
Browser->>CFWorker: GET /api/status/public/surfaces
CFWorker-->>Browser: Surface shows OPERATIONAL, ticket_pending=false
Browser->>CFWorker: GET /api/status/public/incidents
CFWorker-->>Browser: Resolved incident appears in 30-day historical log
The status system is a new schema, not a modification of an existing one. No existing table is altered.
Migration plan:
1. Add status_state, status_audit_log, and status_incidents tables via a numbered migration in backend_v2/db/migrations/.
2. Seed status_state with one row per surface in config/status-surfaces.yaml, all with state=UNKNOWN and state_source=prober.
3. Rollback: DROP TABLE status_incidents; DROP TABLE status_audit_log; DROP TABLE status_state; — no data loss on rollback (the tables are new and operator-readable; they contain no user PII).
YAML registry changes are additive (new surfaces) or removals. Removing a surface requires:
1. Remove from YAML.
2. Delete the corresponding status_state row.
3. Optionally archive status_incidents rows for that surface_id.
4. Update FreeScout component_tag dropdown to remove the option.
| Phase | Description | Gate |
|---|---|---|
| Dark | Deploy schema, YAML config, probe worker — no public surface. Internal status accessible via existing console dashboard. | Dev review |
| Flag | Enable FLAG_STATUS_PUBLIC_ENDPOINT in staging. CF Worker deployed behind staging route. |
Internal dogfood: ops team uses status page for 1 week |
| Beta | status.raxx.app DNS live but not linked from marketing or app. Share URL with 2-3 friendly users for feedback. FreeScout integration active. |
No P1 bugs for 3 days |
| GA | Link from raxx.app footer, app.raxx.app footer, and error pages. Announce in changelog. |
All #581 ACs pass |
public_note field is operator-authored text and must not contain customer-identifying information. The operator runbook (#605) documents this./raxx/freescout/webhook-secret and in the Heroku config var FREESCOUT_WEBHOOK_SECRET. It is rotatable without redeploy (Heroku config var update + dyno restart).X-Status-Worker-Key header authenticates CF Worker calls to /internal/status/snapshot. Stored in Infisical and CF Worker secrets. Rotatable: update in both stores; CF Worker picks up the new secret on next deploy (or immediately via CF API secret update, which takes effect on the next request).status_audit_log rows are retained indefinitely (no PII, small volume). status_incidents rows older than 30 days are deleted by a nightly job. Audit log is DPA-ready: no user PII is stored; the actor field contains either a system identifier (prober, freescout, 3p_poller) or an operator ID (not a customer ID).status_audit_log can be used to demonstrate that users were notified of a service disruption within the required window. No separate breach notification mechanism is required for the status system itself.FLAG_STATUS_PUBLIC_ENDPOINT can be disabled to serve a static "maintenance" response from the CF Worker. The FreeScout webhook receiver on raxx-api-prod continues to write state even when the public endpoint is dark.These require decisions before the corresponding sub-cards can be claimed:
Database hosting for status_state. ~~The design assumes the Postgres instance already backing raxx-api-prod (Heroku Postgres). If the status system gets its own DB (for blast-radius isolation), the Worker's internal data source changes. Decision needed before #603.~~ RESOLVED (#646, 2026-04-30): Status state migrated to Cloudflare D1 (raxx-status-db) for full blast-radius isolation. The Worker reads/writes D1 directly. Postgres is not used for status state.
3P poller scope at launch. #608 evaluates whether partner status APIs are practical to poll. If the evaluation concludes "not worth it at launch," 3P upstream surfaces start as OPERATIONAL and only change via manual override or FreeScout ticket. The state machine supports this — no code change required for a deferred 3P poller.
FreeScout deployment status. The memory file project_console_ticketing_integration.md notes FreeScout was not yet deployed as of 2026-04-25. The status page (#603, #604) can ship with ticket_pending always false and be wired to FreeScout when it is live. This should be confirmed as the intended phasing before #605 and #607 are scheduled.
incident_severity field on FreeScout. ADR-0030 adds a third custom field (incident_severity) that the original #605 card does not mention. The #605 runbook should be updated to include this field. card-groomer should update the #605 acceptance criteria.
Market Calendar source for the widget endpoint. The /widgets/market-time endpoint references a "Market Calendar Service." If this service does not exist in backend_v2/, the endpoint must use a hardcoded NYSE schedule with early-close dates. The decision should be confirmed before #603 implements the endpoint.