Raxx · internal docs

internal · gated ↑ index

status.raxx.app — Data Model, State Machine, and FreeScout Webhook Contract

Status: Accepted (updated 2026-04-30 UTC — D1 migration per #646) Date: 2026-04-28 UTC Owner: software-architect Parent epic: #581 Design card: #601 ADRs: 0028 · 0029 · 0030 Downstream sub-cards: #602 (infra/DNS) · #603 (public API backend) · #604 (React app) · #605 (FreeScout fields) · #607 (webhook receiver) · #608 (3P poller eval) · #646 (D1 migration)

D1 UPDATE (2026-04-30, #646): The runtime state tables (status_state, status_incidents, status_audit_log) have migrated from the shared Heroku Postgres instance to Cloudflare D1 (raxx-status-db), served by a CF Worker (raxx-status-worker) at status.raxx.app/api/*. All references to "Postgres" in this document refer to the superseded architecture. The current schema is SQLite-compatible and identical in structure. Writers (3P poller, FreeScout webhook) POST to the Worker's internal HTTP API instead of writing to a local DB. See docs/ops/runbooks/status-d1.md for operational details.


1. Context

status.raxx.app is a customer-facing page showing the live health of every service Raxx operates or depends on. It is the communication channel between the ops team and users during incidents. It must be reachable when things are broken, must never expose operator-internal data, and must reflect ticket updates within 5 minutes.

This document covers the foundational layer: surface registry, state machine, data model, FreeScout webhook contract, and public API shape. All downstream sub-cards build on these definitions.


2. Invariants

All platform invariants apply. Status-specific constraints:

  1. No credentials in responses. The public endpoint returns zero credential-shaped values. No API keys, no internal hostnames, no stack traces, no Heroku app names that are not already public.
  2. No broker names in public copy. Downstream surfaces (broker connectivity, market data, trade execution) are labeled generically. The word "Alpaca," "SnapTrade," or any broker name never appears in a public API response or page copy.
  3. No forward-looking copy. All status descriptions describe current or historical state. No "expected to recover by," no "likely to," no ETAs unless they are firm operator-entered maintenance window end times.
  4. Audit trail for every state change. Every transition to a non-OPERATIONAL state and every manual override writes to status_audit_log with actor, timestamp, previous state, new state, and source (prober | freescout | manual | schedule).
  5. Privacy boundary is hard. FreeScout ticket URLs and ticket bodies are never exposed publicly. The public surface shows only: ticket_pending: true/false, public_note (operator-written summary, max 280 chars), and resolved_at on historical entries.
  6. Status page must not share hosting with the monitored service. Per ADR-0028, the public endpoint lives on a Cloudflare Worker. A Heroku dyno restart that causes a monitored surface to go DOWN must not also take the status endpoint offline.

3. Surface Registry

3a. Schema (config/status-surfaces.yaml)

# config/status-surfaces.yaml
# One entry per status-page surface.
# id must be a stable slug — never change after first deploy.
# Changing id requires a migration of status_state rows and FreeScout dropdown options.

surfaces:
  - id: raxx-app
    display_name: "raxx.app (marketing site)"
    category: owned            # owned | upstream_3p | downstream_3p
    probe_url: "https://raxx.app/health"
    partner_status_url: null
    public_description: "Raxx marketing and public website."
    partner_name: null

  - id: getraxx-com
    display_name: "getraxx.com"
    category: owned
    probe_url: "https://getraxx.com/"
    partner_status_url: null
    public_description: "Raxx marketing domain redirect."
    partner_name: null

  - id: app-raxx-app
    display_name: "Raxx App"
    category: owned
    probe_url: "https://app.raxx.app/health"
    partner_status_url: null
    public_description: "The main Raxx trading application."
    partner_name: null

  - id: demo-raxx-app
    display_name: "Demo"
    category: owned
    probe_url: "https://demo.raxx.app/health"
    partner_status_url: null
    public_description: "Raxx demo environment."
    partner_name: null

  - id: docs-raxx-app
    display_name: "Documentation"
    category: owned
    probe_url: "https://docs.raxx.app/"
    partner_status_url: null
    public_description: "Raxx public documentation."
    partner_name: null

  - id: internal-docs-raxx-app
    display_name: "Internal Docs"
    category: owned
    probe_url: "https://internal-docs.raxx.app/"
    partner_status_url: null
    public_description: "Internal team documentation."
    partner_name: null

  - id: console-raxx-app
    display_name: "Operator Console"
    category: owned
    probe_url: "https://console.raxx.app/health"
    partner_status_url: null
    public_description: "Internal operator console."
    partner_name: null

  - id: tickets-raxx-app
    display_name: "Support Tickets"
    category: owned
    probe_url: "https://tickets.raxx.app/"
    partner_status_url: null
    public_description: "Customer support ticketing system."
    partner_name: null

  - id: status-raxx-app
    display_name: "Status Page"
    category: owned
    probe_url: "https://status.raxx.app/"
    partner_status_url: null
    public_description: "This status page."
    partner_name: null

  # Email flows
  - id: email-outbound-transactional
    display_name: "Outbound Email (Transactional)"
    category: owned
    probe_url: null            # no HTTP probe; state driven by FreeScout ticket or 3P poller
    partner_status_url: null
    public_description: "Account confirmations, notifications, and password-reset emails from no-reply@raxx.app."
    partner_name: null

  - id: email-inbound-support
    display_name: "Support Inbox"
    category: owned
    probe_url: null
    partner_status_url: null
    public_description: "Inbound email to support@raxx.app and ticket creation."
    partner_name: null

  # Internal services
  - id: api-service
    display_name: "API Service"
    category: owned
    probe_url: "https://api.raxx.app/health"
    partner_status_url: null
    public_description: "Core Raxx API (data, backtesting, account management)."
    partner_name: null

  - id: vault-raxx-app
    display_name: "Secrets Vault"
    category: owned
    probe_url: "https://vault.raxx.app/api/v1/health"
    partner_status_url: null
    public_description: "Internal credential and secrets management. Degradation does not directly affect trading."
    partner_name: null

  - id: ci-runners
    display_name: "CI / Deployment Pipeline"
    category: owned
    probe_url: null
    partner_status_url: null
    public_description: "GitHub Actions CI and deployment workflows."
    partner_name: null

  # Upstream 3P (we depend on them for our own operation)
  - id: cloudflare
    display_name: "Cloudflare"
    category: upstream_3p
    probe_url: null
    partner_status_url: "https://www.cloudflarestatus.com/api/v2/summary.json"
    public_description: "CDN, DDoS protection, and access management. Cloudflare incidents may affect all Raxx-owned sites."
    partner_name: "Cloudflare"

  - id: heroku-platform
    display_name: "Hosting Platform"
    category: upstream_3p
    probe_url: null
    partner_status_url: "https://status.heroku.com/api/v4/current-status"
    public_description: "Cloud hosting for Raxx application services. Platform incidents may affect app availability."
    partner_name: "Heroku"

  - id: email-delivery
    display_name: "Email Delivery"
    category: upstream_3p
    probe_url: null
    partner_status_url: "https://status.postmarkapp.com/api/v2/summary.json"
    public_description: "Email delivery infrastructure. Incidents may delay transactional emails."
    partner_name: "Postmark"

  - id: billing-payment
    display_name: "Billing & Payments"
    category: upstream_3p
    probe_url: null
    partner_status_url: "https://status.stripe.com/api/v2/summary.json"
    public_description: "Payment processing. Billing incidents may affect subscription management."
    partner_name: "Stripe"

  - id: workspace
    display_name: "Team Workspace"
    category: upstream_3p
    probe_url: null
    partner_status_url: "https://www.google.com/appsstatus/rss/en"
    public_description: "Collaboration tools used by the Raxx team. Workspace incidents do not affect trading."
    partner_name: "Google Workspace"

  - id: secrets-management
    display_name: "Secrets Management"
    category: upstream_3p
    probe_url: null
    partner_status_url: null   # Infisical Cloud status URL — evaluate in #608
    public_description: "Managed secrets infrastructure. Incidents may affect credential rotation, not active trading."
    partner_name: "Infisical"

  # Downstream 3P (user-facing; we route to them; generic names per invariant)
  - id: broker-connectivity
    display_name: "Broker Connectivity"
    category: downstream_3p
    probe_url: null
    partner_status_url: null   # no single partner URL — per BYOB hybrid; evaluate per #608
    public_description: "Connection to your brokerage for order routing and account data. Your trades are unaffected during Raxx-side degradations; broker-side incidents are noted here."
    partner_name: null         # intentionally null — generic copy only

  - id: market-data-feed
    display_name: "Market Data Feed"
    category: downstream_3p
    probe_url: null
    partner_status_url: null
    public_description: "Real-time and historical market data. Feed disruptions may delay chart updates and backtest data."
    partner_name: null

  - id: trade-execution
    display_name: "Trade Execution Path"
    category: downstream_3p
    probe_url: null
    partner_status_url: null
    public_description: "Order submission and execution. Disruptions are noted here with status from your connected broker."
    partner_name: null

3b. Runtime state table (status_state)

Postgres table. One row per surface_id. Written by: probe worker, FreeScout webhook receiver, 3P poller, manual override endpoint.

CREATE TABLE status_state (
    surface_id          TEXT PRIMARY KEY,           -- FK to YAML registry id
    state               TEXT NOT NULL               -- OPERATIONAL|DEGRADED|PARTIAL|DOWN|MAINTENANCE|UNKNOWN
                            CHECK (state IN ('OPERATIONAL','DEGRADED','PARTIAL','DOWN','MAINTENANCE','UNKNOWN')),
    state_since         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    state_source        TEXT NOT NULL               -- prober|freescout|3p_poller|manual|schedule
                            CHECK (state_source IN ('prober','freescout','3p_poller','manual','schedule')),
    ticket_pending      BOOLEAN NOT NULL DEFAULT FALSE,
    public_note         TEXT,                       -- operator-written, max 280 chars; NULL if no active note
    last_probe_ok       BOOLEAN,                    -- NULL for surfaces with no probe_url
    last_probe_at       TIMESTAMPTZ,
    last_probe_latency_ms INTEGER,
    maintenance_until   TIMESTAMPTZ,                -- non-null only when state=MAINTENANCE
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE status_audit_log (
    id                  BIGSERIAL PRIMARY KEY,
    surface_id          TEXT NOT NULL,
    actor               TEXT NOT NULL,              -- 'prober'|'freescout'|'3p_poller'|operator_id
    previous_state      TEXT,
    new_state           TEXT NOT NULL,
    source              TEXT NOT NULL,
    note                TEXT,                       -- human-readable reason (optional)
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON status_audit_log (surface_id, created_at DESC);

CREATE TABLE status_incidents (
    id                  BIGSERIAL PRIMARY KEY,
    surface_id          TEXT NOT NULL,
    opened_at           TIMESTAMPTZ NOT NULL,
    resolved_at         TIMESTAMPTZ,               -- NULL = still open
    public_note         TEXT,                      -- final operator-written note at close
    freescout_ticket_id BIGINT,                    -- internal reference only, never exposed publicly
    severity            TEXT CHECK (severity IN ('degraded','partial','down','maintenance')),
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON status_incidents (surface_id, resolved_at DESC);
-- Incidents older than 30 days are moved to cold storage or deleted by a nightly job.
-- The 30-day window is a product requirement from #581.

4. State Machine

4a. States

State Public label Badge color Description
OPERATIONAL Operational Green Fully functional
DEGRADED Degraded performance Amber Functional but impaired
PARTIAL Partial outage Orange Subset of functionality unavailable
DOWN Service disruption Red Non-functional
MAINTENANCE Scheduled maintenance Blue/slate Planned downtime
UNKNOWN Status unknown Slate No data yet or probe gap

4b. Transition diagram

stateDiagram-v2
    [*] --> UNKNOWN : surface registered

    UNKNOWN --> OPERATIONAL : prober(N=3 ok)
    UNKNOWN --> DEGRADED : freescout(ticket open, severity=degraded)
    UNKNOWN --> DOWN : freescout(ticket open, severity=down)

    OPERATIONAL --> DEGRADED : prober(1 failure) | freescout(ticket open, severity=degraded) | 3p_poller(partner degraded)
    OPERATIONAL --> PARTIAL : freescout(ticket open, severity=partial)
    OPERATIONAL --> DOWN : prober(N=3 failures) | freescout(ticket open, severity=down) | 3p_poller(partner down)
    OPERATIONAL --> MAINTENANCE : schedule(window start) | manual
    OPERATIONAL --> UNKNOWN : prober(gap > 90s)

    DEGRADED --> OPERATIONAL : prober(N=3 ok) AND no open ticket
    DEGRADED --> DOWN : prober(N=3 failures) | freescout(severity escalated to down)
    DEGRADED --> MAINTENANCE : schedule(window start) | manual
    DEGRADED --> UNKNOWN : prober(gap > 90s, no open ticket)

    PARTIAL --> OPERATIONAL : freescout(ticket close) AND prober(ok or no probe)
    PARTIAL --> DOWN : freescout(severity escalated)
    PARTIAL --> MAINTENANCE : manual

    DOWN --> DEGRADED : prober(N=3 ok) AND open ticket exists
    DOWN --> OPERATIONAL : prober(N=3 ok) AND no open ticket
    DOWN --> MAINTENANCE : schedule | manual

    MAINTENANCE --> OPERATIONAL : schedule(window end) — prober confirms within 60s
    MAINTENANCE --> DOWN : prober(failure at window end)

    note right of OPERATIONAL
      Conflict rule: manual override > MAINTENANCE > DOWN > PARTIAL > DEGRADED > UNKNOWN > OPERATIONAL
    end note

4c. Prober hysteresis

The prober applies N=3 hysteresis in both directions to prevent flapping: - A surface transitions from OPERATIONAL to DEGRADED after 1 consecutive probe failure (fast detection). - A surface transitions from DEGRADED to DOWN after 3 consecutive probe failures. - A surface transitions from DOWN/DEGRADED to OPERATIONAL after 3 consecutive probe successes.

Probes run every 60 seconds. Maximum detection latency from first failure to DEGRADED: 60s. Maximum recovery latency from last failure to OPERATIONAL: 3 minutes.


5. FreeScout Webhook Contract

5a. Overview

FreeScout fires a webhook on ticket create, update, and close. The webhook receiver (POST /api/webhooks/freescout) on raxx-api-prod processes the event, updates status_state, writes to status_incidents, and then signals the CF Worker cache invalidation endpoint.

FreeScout webhook version: v1 (standard JSON POST). FreeScout supports HMAC-SHA256 signature verification; use this (not plaintext shared secret). The signature is in the X-FreeScout-Signature header as sha256=<hex>.

5b. Custom fields

Per #605, two custom fields are required on FreeScout tickets (and a third added by ADR-0030):

FreeScout field slug Type Purpose
component_tag Select (dropdown) Maps ticket to a surface_id from the registry
public_status Text (max 280 chars) Operator-written public note mirrored to status page
incident_severity Select degraded / partial / down; drives state transition target

The field slugs above are the slugs the webhook receiver reads. They must match exactly. The #605 runbook documents how to create these fields and verify their slugs via FreeScout API.

5c. Webhook payload (inbound to receiver)

FreeScout POSTs a JSON body. The receiver only reads the fields below; all others are ignored.

{
  "event": "ticket.created",       // ticket.created | ticket.updated | ticket.status_changed
  "ticket": {
    "id": 4821,                    // FreeScout numeric ticket ID — stored in status_incidents, never exposed publicly
    "status": "active",            // active | pending | closed
    "customFields": [
      {
        "id": 12,                  // FreeScout field ID — not used; slug is canonical
        "name": "Component Tag",
        "slug": "component_tag",
        "value": "app-raxx-app"    // must match a surface_id in the YAML registry
      },
      {
        "id": 13,
        "name": "Public Status",
        "slug": "public_status",
        "value": "Login flow is intermittently failing. We are investigating."
      },
      {
        "id": 14,
        "name": "Incident Severity",
        "slug": "incident_severity",
        "value": "degraded"        // degraded | partial | down
      }
    ]
  }
}

5d. Receiver processing logic

function handle_freescout_webhook(payload):
  1. Verify HMAC-SHA256 signature → HTTP 401 if invalid
  2. Extract surface_id = customFields["component_tag"].value
  3. If surface_id not in registry → log warning, return HTTP 200 (graceful no-op)
  4. Extract public_note = customFields["public_status"].value (strip to 280 chars)
  5. Extract severity = customFields["incident_severity"].value (default "degraded")
  6. Extract ticket_status = payload.ticket.status

  if ticket_status in ("active", "pending"):
    7a. Upsert status_state:
          ticket_pending = true
          public_note = public_note
          state = resolve_ticket_state(current_prober_state, severity)
          state_source = "freescout"
          updated_at = now()
    7b. Upsert status_incidents:
          opened_at = now() if new; leave if existing
          severity = severity
          public_note = public_note
    7c. Write status_audit_log row
    7d. Trigger CF cache invalidation for the surface

  if ticket_status == "closed":
    8a. Determine new state:
          if last_probe_ok is true or (probe_url is null): new_state = "OPERATIONAL"
          else: new_state = "DEGRADED"  (probe still failing — ticket closed but issue persists)
    8b. Update status_state:
          ticket_pending = false
          public_note = null
          state = new_state
          state_source = "freescout"
          updated_at = now()
    8c. Close status_incidents row:
          resolved_at = now()
          public_note = public_note (the final note at close)
    8d. Write status_audit_log row
    8e. Trigger CF cache invalidation

function resolve_ticket_state(prober_state, severity):
  if prober_state == "DOWN": return "DOWN"          // probe wins; don't downgrade
  if severity == "down":    return "DOWN"
  if severity == "partial": return "PARTIAL"
  return "DEGRADED"                                  // default for severity=degraded or unset

5e. Idempotency and replay

The FreeScout webhook fires on every ticket update, including field edits. The receiver must be idempotent: - status_state rows are upserted on surface_id. Multiple webhook calls for the same ticket and same state are no-ops at the DB level (no duplicate rows). - status_incidents rows use freescout_ticket_id as a unique key — INSERT ... ON CONFLICT (freescout_ticket_id) DO UPDATE prevents duplicate incident rows. - Cache invalidation is idempotent (CF purge by URL is safe to call multiple times).

If FreeScout retries a webhook (e.g., after a timeout), the second call produces identical state. No deduplication queue is needed.

5f. Delivery failure and retry

FreeScout retries failed webhook deliveries (non-2xx response) up to 3 times with exponential backoff. The receiver must return HTTP 200 for all successfully processed events, including graceful no-ops (unknown surface_id). HTTP 5xx causes a retry.

A separate "webhook retry queue" sub-card is not needed at this scale. If FreeScout's built-in retry is insufficient (e.g., if the receiver itself is down for an extended period), the operator can manually re-trigger by updating any field on the affected ticket.


6. Public API Contract

6a. Endpoints

All endpoints are served by the Cloudflare Worker (per ADR-0028). The Worker fetches fresh data from GET /internal/status/snapshot on raxx-api-prod and caches the response.

GET /api/status/public/surfaces Returns the current state of all surfaces. Cache: 60s.

{
  "generated_at": "2026-04-28T14:30:00Z",   // ISO 8601 UTC
  "overall_status": "DEGRADED",              // worst non-OPERATIONAL state across owned surfaces
  "surfaces": [
    {
      "id": "app-raxx-app",
      "display_name": "Raxx App",
      "category": "owned",                   // owned | upstream_3p | downstream_3p
      "state": "DEGRADED",                   // OPERATIONAL|DEGRADED|PARTIAL|DOWN|MAINTENANCE|UNKNOWN
      "state_since": "2026-04-28T13:55:00Z",
      "ticket_pending": true,
      "public_note": "Login flow is intermittently failing. We are investigating.",
      "public_description": "The main Raxx trading application.",
      "last_checked_at": "2026-04-28T14:29:45Z",  // null for surfaces with no probe_url
      "maintenance_until": null              // ISO 8601 UTC or null
    },
    {
      "id": "cloudflare",
      "display_name": "Cloudflare",
      "category": "upstream_3p",
      "state": "OPERATIONAL",
      "state_since": "2026-04-01T00:00:00Z",
      "ticket_pending": false,
      "public_note": null,
      "public_description": "CDN, DDoS protection, and access management. Cloudflare incidents may affect all Raxx-owned sites.",
      "last_checked_at": null,
      "maintenance_until": null
    },
    {
      "id": "broker-connectivity",
      "display_name": "Broker Connectivity",
      "category": "downstream_3p",
      "state": "OPERATIONAL",
      "state_since": "2026-04-01T00:00:00Z",
      "ticket_pending": false,
      "public_note": null,
      "public_description": "Connection to your brokerage for order routing and account data.",
      "last_checked_at": null,
      "maintenance_until": null
    }
  ]
}

Fields that are never present in the public response: probe_url, partner_status_url, last_probe_latency_ms, freescout_ticket_id, partner_name for downstream_3p (generic copy only), any internal hostname or error string from the probe.

GET /api/status/public/incidents Returns resolved incidents from the last 30 days. Cache: 300s.

{
  "generated_at": "2026-04-28T14:30:00Z",
  "incidents": [
    {
      "id": "inc_00421",                     // opaque identifier — not the FreeScout ticket ID
      "surface_id": "app-raxx-app",
      "surface_display_name": "Raxx App",
      "severity": "degraded",
      "opened_at": "2026-04-25T10:15:00Z",
      "resolved_at": "2026-04-25T11:42:00Z",
      "public_note": "Login flow intermittently failed. Root cause identified and resolved."
    }
  ]
}

Active (unresolved) incidents are represented by ticket_pending: true on the surface in /surfaces, not in the /incidents list. The /incidents list is historical only.

GET /api/status/public/widgets/market-time Returns current market time data for the MarketTimeWidget. Cache: 30s.

{
  "generated_at": "2026-04-28T14:30:00Z",
  "utc_now": "2026-04-28T14:30:00Z",
  "et_now": "2026-04-28T10:30:00-04:00",
  "market_state": "open",                    // open | closed | pre-market | after-hours | holiday
  "next_event": {
    "event": "market_close",
    "at": "2026-04-28T20:00:00Z"
  },
  "holiday": null                            // or {"name": "Independence Day", "date": "2026-07-04"}
}

6b. Operator-only fields (not on public endpoint)

The internal /api/status/ endpoints (operator-only, CF Access gated) include additional fields not present in the public response: probe_url, last_probe_latency_ms, last_probe_status_code, last_probe_error, freescout_ticket_id, sentry_errors_24h, build_status, last_deploy, state_audit_log.

6c. CORS

The Worker emits:

Access-Control-Allow-Origin: https://status.raxx.app
Access-Control-Allow-Methods: GET
Access-Control-Max-Age: 86400

7. Polling Cadence and Caching

Layer Interval Notes
Probe worker (owned surfaces with probe_url) 60s Runs in Raptor (raxx-api-prod background thread or cron); writes status_state
3P poller (upstream partners per #608) 300s Separate background job; reads partner status APIs; writes status_state
CF Worker cache (surfaces) 60s max-age Invalidated on state change by webhook receiver
CF Worker cache (incidents) 300s max-age Invalidated on incident close
CF Worker cache (market-time) 30s max-age Time-sensitive; short TTL
Status page React app polling 120s Client-side interval; reads the Worker endpoint

The status page frontend polls the CF Worker every 120 seconds. Combined with the 60s Worker cache, the maximum end-to-end latency for a probe-detected state change to reach a browser is 180s (3 minutes). FreeScout-driven changes trigger cache invalidation, reducing this to the time for the Worker to propagate (typically <5s), well within the 5-minute SLA.


8. Sequence Diagrams

8a. Operator updates ticket public-status → status page reflects within 5 minutes

sequenceDiagram
    actor Operator
    participant FreeScout
    participant WebhookReceiver as Raptor<br/>POST /api/webhooks/freescout
    participant DB as Postgres<br/>status_state
    participant AuditLog as Postgres<br/>status_audit_log
    participant CFWorker as CF Worker<br/>/api/status/public
    participant Browser as Customer Browser<br/>status.raxx.app

    Operator->>FreeScout: Updates public_status field on ticket
    FreeScout->>WebhookReceiver: POST webhook (ticket.updated)
    WebhookReceiver->>WebhookReceiver: Verify HMAC signature
    WebhookReceiver->>DB: Upsert status_state (ticket_pending=true, public_note=...)
    WebhookReceiver->>AuditLog: INSERT audit row (actor=freescout, new_state=DEGRADED)
    WebhookReceiver->>CFWorker: POST /internal/cache/invalidate?surface=app-raxx-app
    WebhookReceiver-->>FreeScout: HTTP 200
    Browser->>CFWorker: GET /api/status/public/surfaces (polling interval)
    CFWorker->>WebhookReceiver: GET /internal/status/snapshot (cache miss)
    CFWorker-->>Browser: JSON with ticket_pending=true, public_note="..."
    Note over Browser: Status page shows "Pending issue" badge<br/>within one polling cycle (≤120s after cache invalidation)

8b. 3P partner outage → status page reflects "tracking incident"

sequenceDiagram
    participant PartnerAPI as Partner Status API<br/>(e.g., status.heroku.com)
    participant Poller3P as Raptor<br/>3P Poller (5-min cron)
    participant DB as Postgres<br/>status_state
    participant AuditLog as Postgres<br/>status_audit_log
    participant CFWorker as CF Worker
    participant Browser as Customer Browser

    loop Every 5 minutes
        Poller3P->>PartnerAPI: GET partner status JSON
        PartnerAPI-->>Poller3P: {status: "major_outage"}
    end
    Poller3P->>DB: UPDATE status_state SET state=DOWN, state_source=3p_poller WHERE surface_id=heroku-platform
    Poller3P->>AuditLog: INSERT audit row (source=3p_poller, new_state=DOWN)
    Poller3P->>CFWorker: POST /internal/cache/invalidate?surface=heroku-platform
    Browser->>CFWorker: GET /api/status/public/surfaces
    CFWorker-->>Browser: JSON with heroku-platform.state=DOWN,<br/>public_description="We're tracking Heroku's incident — our service resumes when they recover."
    Note over Browser: 3P tile shows "Service disruption"<br/>with partner-incident copy

8c. Ticket closes → surface clears + historical incident written

sequenceDiagram
    actor Operator
    participant FreeScout
    participant WebhookReceiver as Raptor<br/>POST /api/webhooks/freescout
    participant DB as Postgres<br/>status_state
    participant Incidents as Postgres<br/>status_incidents
    participant AuditLog as Postgres<br/>status_audit_log
    participant CFWorker as CF Worker
    participant Browser as Customer Browser

    Operator->>FreeScout: Closes ticket (marks resolved)
    FreeScout->>WebhookReceiver: POST webhook (ticket.status_changed, status=closed)
    WebhookReceiver->>WebhookReceiver: Verify HMAC; check prober state
    Note over WebhookReceiver: prober state = OPERATIONAL → transition to OPERATIONAL
    WebhookReceiver->>DB: UPDATE status_state SET ticket_pending=false, public_note=null, state=OPERATIONAL
    WebhookReceiver->>Incidents: UPDATE status_incidents SET resolved_at=now(), public_note=final_note
    WebhookReceiver->>AuditLog: INSERT audit row (source=freescout, new_state=OPERATIONAL)
    WebhookReceiver->>CFWorker: POST /internal/cache/invalidate (surfaces + incidents)
    WebhookReceiver-->>FreeScout: HTTP 200
    Browser->>CFWorker: GET /api/status/public/surfaces
    CFWorker-->>Browser: Surface shows OPERATIONAL, ticket_pending=false
    Browser->>CFWorker: GET /api/status/public/incidents
    CFWorker-->>Browser: Resolved incident appears in 30-day historical log

9. Migrations

The status system is a new schema, not a modification of an existing one. No existing table is altered.

Migration plan: 1. Add status_state, status_audit_log, and status_incidents tables via a numbered migration in backend_v2/db/migrations/. 2. Seed status_state with one row per surface in config/status-surfaces.yaml, all with state=UNKNOWN and state_source=prober. 3. Rollback: DROP TABLE status_incidents; DROP TABLE status_audit_log; DROP TABLE status_state; — no data loss on rollback (the tables are new and operator-readable; they contain no user PII).

YAML registry changes are additive (new surfaces) or removals. Removing a surface requires: 1. Remove from YAML. 2. Delete the corresponding status_state row. 3. Optionally archive status_incidents rows for that surface_id. 4. Update FreeScout component_tag dropdown to remove the option.


10. Rollout Plan

Phase Description Gate
Dark Deploy schema, YAML config, probe worker — no public surface. Internal status accessible via existing console dashboard. Dev review
Flag Enable FLAG_STATUS_PUBLIC_ENDPOINT in staging. CF Worker deployed behind staging route. Internal dogfood: ops team uses status page for 1 week
Beta status.raxx.app DNS live but not linked from marketing or app. Share URL with 2-3 friendly users for feedback. FreeScout integration active. No P1 bugs for 3 days
GA Link from raxx.app footer, app.raxx.app footer, and error pages. Announce in changelog. All #581 ACs pass

11. Security Considerations


12. Open Questions

These require decisions before the corresponding sub-cards can be claimed:

  1. Database hosting for status_state. ~~The design assumes the Postgres instance already backing raxx-api-prod (Heroku Postgres). If the status system gets its own DB (for blast-radius isolation), the Worker's internal data source changes. Decision needed before #603.~~ RESOLVED (#646, 2026-04-30): Status state migrated to Cloudflare D1 (raxx-status-db) for full blast-radius isolation. The Worker reads/writes D1 directly. Postgres is not used for status state.

  2. 3P poller scope at launch. #608 evaluates whether partner status APIs are practical to poll. If the evaluation concludes "not worth it at launch," 3P upstream surfaces start as OPERATIONAL and only change via manual override or FreeScout ticket. The state machine supports this — no code change required for a deferred 3P poller.

  3. FreeScout deployment status. The memory file project_console_ticketing_integration.md notes FreeScout was not yet deployed as of 2026-04-25. The status page (#603, #604) can ship with ticket_pending always false and be wired to FreeScout when it is live. This should be confirmed as the intended phasing before #605 and #607 are scheduled.

  4. incident_severity field on FreeScout. ADR-0030 adds a third custom field (incident_severity) that the original #605 card does not mention. The #605 runbook should be updated to include this field. card-groomer should update the #605 acceptance criteria.

  5. Market Calendar source for the widget endpoint. The /widgets/market-time endpoint references a "Market Calendar Service." If this service does not exist in backend_v2/, the endpoint must use a hardcoded NYSE schedule with early-close dates. The decision should be confirmed before #603 implements the endpoint.