Surface state machine review — 2026-05-13

Public / Private / Protect gap analysis

Date: 2026-05-13T00:00:00Z
Author: security-agent
Status: Design memo — operator decision required. NOT implementation.
Surfaces in scope: getraxx.com, api.raxx.app, app.raxx.app, console.raxx.app, demo.raxx.app, vault.raxx.app, tickets.raxx.app, raxx.app, status.raxx.app, raxx-mockups, raxx-app-previews, docs.raxx.app, internal-docs.raxx.app
Related docs:
- docs/security/web-surface-posture.md
- docs/security/auth-posture.md
- docs/security/waf-threat-model-2026-05-11.md
- docs/ops/runbooks/deploy-freeze.md
Related issues/PRs:
- #635 Deploy freeze — existing global deploy gate
- #1364 Pre-launch security digest notification — per-event notification policy for CRIT/HIGH
- #1709 Deploy freeze per-surface — referenced in deploy-freeze runbook as future scope

Section 1 — Summary

Operator's proposal restated

Each surface tile on the console dashboard gets a three-state toggle:

Public — available to the public with no CF Access gate, normal serving
Private — behind CF Access (auth wall in front); equivalent to the current pre-launch gating posture
Protect — surface taken down and replaced with a branded error / "Status Down" page; an entry is automatically added to status.raxx.app

The goal is a single control plane so the operator does not need to "run around" under pressure during an attack or incident.

Security verdict

The three-state model is a real improvement over no control plane at all. It is not sufficient on its own, but it does not need to be complete at launch.

The operator's instinct is correct: the absence of a unified control plane is itself a security risk because it adds minutes of cognitive overhead and manual steps during the first sixty seconds of an incident. Three states are sufficient to survive a v1 launch if the implementation is done correctly — specifically if Protect is implemented with integrity (CF-layer redirect, not just an app-layer change that can be bypassed via the Heroku origin URL).

Where the model falls short is the middle ground between "mostly up" and "completely down." Several realistic attack scenarios — a partial-outage degradation, a geographically clustered DDoS, a single compromised endpoint — need a response that is neither "leave it up" nor "take it down." Those gaps matter but most of them are post-launch hardening, not v1 blockers.

The biggest risk in the operator's model is not the number of states. It is the implementation assumption: if Protect only modifies app-layer behavior without also engaging CF WAF rules and disabling origin-direct access, an attacker who already knows the .herokuapp.com URL is completely unaffected. A state toggle that can be bypassed is theater, not defense.

Section 2 — Coverage gaps

The following states are missing from the three-state model. Each is evaluated for inclusion in v1 versus post-launch hardening.

Gap A1 — Read-Only / Degraded

What it is: Surface stays up but all write operations are disabled. Reads (GET requests) succeed; mutations (POST, PUT, DELETE) are rejected at the application layer with a structured error message.

When you need it: - Partial DB failure — reads are fine, writes are failing with errors. Customers should still be able to view their portfolios but not submit orders. - Exploit mitigation while a patch is being prepared. Example: a vulnerability in the order-submission endpoint is suspected. Disable writes on api.raxx.app for the affected endpoint group while the fix lands, rather than taking the entire API down. - Scheduled maintenance window where you want customers to read but not transact.

Why the three-state model misses it: There is no state between "fully up" (Public or Private) and "completely down" (Protect) that allows graceful write-disabling. Reaching for Protect when only writes need to stop is excessive and creates unnecessary customer disruption.

Recommendation: Include in v1. This is implementable as a feature flag per endpoint group (FLAG_WRITES_DISABLED=true returning 503 Service Unavailable on mutations) rather than a surface-level state. It does not require CF API calls. Composable with the three-state model. See Section 7 — this composes with feature flags, not the surface state machine.

Gap A2 — Rate-Limit Emergency Mode

What it is: Surface stays up but CF WAF rate-limit rules are tightened from normal thresholds to emergency thresholds (e.g., 1 req/5 sec per IP instead of 100 req/min). Legitimate low-rate users continue to function. Automated high-rate traffic is blocked.

When you need it: - L7 DDoS where traffic is distributed across many IPs but still bursty enough to be rate-limitable. Per the WAF threat model (docs/security/waf-threat-model-2026-05-11.md Section T5), application-layer DDoS on compute-intensive endpoints (backtest, order submission) is the primary L7 threat. - Credential stuffing waves on auth endpoints where the attack is below normal Block thresholds but clearly elevated. - Sudden traffic surge from a linked social media post that is overwhelming the origin — not malicious, but needs throttling to protect capacity.

Why the three-state model misses it: Public serves normally, Protect takes the surface down. There is no middle state that tightens rate limits without full takedown.

Recommendation: Post-launch. This requires CF WAF API integration (toggling between rate-limit rule sets), which adds implementation complexity. The existing static WAF rules (once implemented per the WAF threat model) provide a baseline. Emergency rate-limit tightening can be done manually via the CF dashboard in the interim.

Gap A3 — Geo-Block Escalated

What it is: Surface stays up for most of the world but blocks traffic from specific countries or regions. More targeted than Protect (surface stays live for legitimate traffic) and more surgical than a global takedown.

When you need it: - Attack pattern is clearly geographically clustered. Example: 90% of suspicious login-begin requests originate from one country. - Pre-launch Quebec geo-block that is already planned for signup paths (per [[project_quebec_geoblock_decision]] — Path A chosen for 2026-05-23 launch). This is currently a specific path-level block, but a surface-level geo-escalation would allow extension to the entire surface if the attack scope widens. - Operator is aware of a specific region conducting a targeted campaign.

Why the three-state model misses it: No state corresponds to "up everywhere except here." The CF WAF already supports country-level blocking per docs/security/waf-threat-model-2026-05-11.md S3 Custom Rule 7, but this is a static configuration, not a surface-state-machine-driven toggle.

Recommendation: Post-launch. Geo-block rules are best managed as CF WAF custom rules per surface, not as surface-state machine entries. The surface state machine should surface a "geo-restriction active" badge on the tile rather than owning the geo-block configuration. The CF WAF integration work is the prerequisite.

Gap A4 — Allowlist-Only

What it is: Surface is up but only accepts requests from an explicit IP allowlist (operator home, agent runners, known partners). All other traffic receives a 403 or redirect to a holding page.

When you need it: - Full lockdown that is less disruptive than Protect (the surface is still functionally running for privileged users) but blocks the entire public internet. - Useful for console.raxx.app during a suspected compromise: lock it down to known operator IPs while investigation proceeds without fully taking the console down (which would impair the investigation itself — investigation must never disable the control being used to investigate). - Pre-production testing where you need the surface accessible to specific parties only.

Why the three-state model misses it: Private (CF Access) gates by identity; Allowlist-only gates by IP. These are different controls. A compromised CF Access identity would still be blocked by an IP allowlist but not by Private state. For console.raxx.app this distinction matters — operator account compromise is an insider-threat scenario where CF Access identity cannot be trusted.

Recommendation: Post-launch for most surfaces. For console.raxx.app and vault.raxx.app specifically, this is a useful incident response tool worth targeting for the first post-launch security hardening sprint. Implementation is a CF WAF custom rule per surface toggled via CF API.

Gap A5 — Maintenance

What it is: Surface is intentionally down with planned-maintenance messaging that is distinct from the attack/incident messaging used in Protect. The copy, customer-facing explanation, and estimated return time are different.

When you need it: - Scheduled database migrations or infrastructure changes with a known window. - Velvet secret rotation that requires a brief service interruption. - Heroku dyno restart during a config change that causes a brief downtime. - Any case where "we're upgrading the platform" is the honest message rather than "we're responding to an incident."

Why the three-state model misses it: Protect's "Status Down" copy implies an unplanned incident. Using Protect for a planned maintenance window either requires custom copy per invocation (which adds complexity) or sends inaccurate signals to customers (which erodes trust). The two scenarios are semantically different and should be distinguishable in the status.raxx.app entry as well.

Recommendation: Include in v1 as a fourth state. The implementation is nearly identical to Protect (takedown page + status entry) with three differences: (1) different copy template ("Scheduled Maintenance" vs "Unexpected Downtime"), (2) the status.raxx.app entry includes an estimated_return_at field, and (3) the state can be scheduled in advance (future capacity) or applied immediately. This is low implementation complexity given that Protect's infrastructure would already exist.

Gap A6 — CF Challenge Mode

What it is: Surface stays up but every request is served a Cloudflare JS Challenge or Managed Challenge before being passed to the origin. Legitimate users with real browsers pass within a few seconds. Automated tools and most bots fail.

When you need it: - Suspected bot scraping that is not volumetric enough to trigger Block rules. - Pre-launch probing by a competitor or automated scanner. - A DDoS wave that is sophisticated enough to pass rate limits but shows bot fingerprints (CF Bot Score between 10 and 30). - As a lower-cost alternative to taking a surface fully down when the attack is primarily automated.

Why the three-state model misses it: This state exists between Public (no challenge) and Protect (full takedown). It allows maintaining service for legitimate users while filtering automated traffic. It is particularly relevant for api.raxx.app where Protect would cut off all customer access but Challenge would cut off only bots.

Recommendation: Post-launch. Requires CF API integration to toggle challenge mode per zone or per-surface WAF rule. Per [[feedback_cf_access_does_not_bypass_bot_fight_mode]], Bot Fight Mode and CF Access are separate layers and require careful composition. Include in the post-launch WAF hardening sprint.

Gap A7 — Service-Token-Only

What it is: Surface accepts only requests carrying a valid CF Access service token (i.e., machine-to-machine calls from agent runners, CI, Velvet). Human browser traffic is rejected.

When you need it: - You need agent workflows (secret rotation, health checks, deploy monitoring) to continue operating while human access is suspended during an investigation. - vault.raxx.app during a suspected human credential compromise: agent service tokens continue to function, human operator access is suspended until credentials are rotated and verified. - Distinguishes between "humans are locked out" and "everything is locked out" — a distinction that matters when the investigation requires automated tooling to keep running.

Why the three-state model misses it: Private (CF Access) gates all traffic with the same allowlist — it does not distinguish service tokens from human sessions. Protect takes everything down including agent calls.

Recommendation: Post-launch. For vault.raxx.app specifically this is a high-value incident response tool. CF Access service token policies with decision=non_identity (per [[feedback_cf_access_service_token_needs_non_identity]]) handle this correctly. Include in the post-launch security hardening sprint alongside Allowlist-Only.

Gap A8 — Read-Replica Only (DB writes at data layer)

What it is: For DB-backed surfaces, writes are rejected at the database layer regardless of what the application layer accepts. Differs from A1 (Read-Only) in that the rejection happens at the DB driver level (read-only replica connection) rather than at the route handler.

When you need it: - DB master is failing writes but read replica is healthy. Surface can stay up in a degraded read-only mode without any application-layer change. - Forced read-only mode during a backup-and-restore window.

Why the three-state model misses it: Surface state can't help here — this is a data-layer concern, not an edge-layer concern.

Recommendation: Skip for the surface state machine entirely. This is an infra-layer concern (sre-agent) and an application-layer concern (fail-safe handling when DB writes fail). Documented here as out of scope for surface state.

Section 3 — Recommended state set

States (5 for v1, 3 additional post-launch)

v1 states (recommended for 2026-05-23 launch or close following sprint):

State	One-line description
Public	Available to the public; no CF Access gate; normal WAF rules apply
Private	CF Access auth wall in front; only allowlisted identities reach the surface
Maintenance	Intentionally down with planned-maintenance copy; estimated return time shown on status page
Protect	Emergency takedown; branded "we're investigating" page; status.raxx.app entry created; Slack alert fires
Offline	Soft-archived or pre-launch — surface is registered but not actively serving public traffic

Offline captures the current state of surfaces like support.raxx.app (suppressed per #1701) and raxx-app-previews that are registered in the surface registry but not serving public traffic. It is distinct from Protect (emergency) and Maintenance (planned temporary down). Without it, the console dashboard has no way to represent a known-inactive surface without marking it as an incident.

Post-launch states (hardening sprint, not launch blockers):

State	One-line description
Challenge	CF Managed Challenge on every request; catches bots, allows legitimate browsers
Allowlist-Only	Restricted to a configured IP CIDR list; all other traffic blocked at CF layer
Service-Token-Only	Accepts only CF Access service-token-authenticated machine requests; human sessions rejected

Allowed state transitions

Offline ──────────────────────────► Public
Offline ──────────────────────────► Private
Private ◄────────────────────────► Public
Public ───────────────────────────► Maintenance
Private ──────────────────────────► Maintenance
Public ───────────────────────────► Protect
Private ──────────────────────────► Protect
Maintenance ──────────────────────► Protect  (escalation during window)
Protect ──────────────────────────► Private  (return to controlled access after incident)
Private ──────────────────────────► Public   (after incident is resolved and access is confirmed clean)

[Post-launch only]
Public ───────────────────────────► Challenge
Challenge ────────────────────────► Public
Challenge ────────────────────────► Protect  (escalation)
Public ───────────────────────────► Allowlist-Only
Allowlist-Only ───────────────────► Private
Allowlist-Only ───────────────────► Protect
Public ───────────────────────────► Service-Token-Only
Service-Token-Only ───────────────► Private
Service-Token-Only ───────────────► Protect

Transitions that require explicit operator justification (two-step confirmation with reason field): - Any transition → Protect - Any transition → Maintenance - Protect → Private (incident declared-contained)

Transitions that require a TOTP elevation (per existing @require_totp_elevation middleware): - Any transition → Protect - Protect → any state (removal from emergency mode)

Default state per surface type

Surface type	Default state	Rationale
Public customer-facing (raxx.app, app.raxx.app, getraxx.com, docs.raxx.app, status.raxx.app, demo.raxx.app)	Public (at/after launch) or Private (pre-launch)	Customer funnel surfaces; Private until launch, Public after
Operator admin (console.raxx.app)	Private	Always behind CF Access; no scenario where it should be Public
Credential/secret surfaces (vault.raxx.app)	Private	Class 3 surface per `docs/security/auth-posture.md`; CF Access + strong MFA always
Third-party tool operator surfaces (tickets.raxx.app)	Private	Class 3; FreeScout auth is not audited at our standard
Internal / static (raxx-mockups, raxx-app-previews, internal-docs.raxx.app)	Private or Offline	Class 4; CF Access is sole auth
Unbuilt / suppressed surfaces	Offline	Not yet routable; surface exists in registry but is not live

Section 4 — Per-state behavior matrix

For each state, the following controls describe what is engaged or disengaged at each layer.

Behavior	Public	Private	Maintenance	Protect	Offline	Challenge*	Allowlist-Only*	Service-Token-Only*
CF Access gate	Off	On	Off (takedown page served at CF layer)	Off (takedown page served at CF layer)	N/A	Off	Off	On (service tokens only, `decision=non_identity`)
WAF rules applied	Normal	Normal	Normal (until CF serves takedown)	Tightest preset: Bot Fight Mode ON, rate limits to minimum	N/A	Normal + JS Challenge on all requests	Normal + IP allowlist block	Normal
Origin reachable via Heroku URL	Only if `FLAG_ENFORCE_CF_ORIGIN=false` (existing gap)	Same gap as Public	CF-layer redirect intercepts before origin reaches Heroku	CF-layer redirect intercepts before origin	N/A	Same gap as Public	CF-layer IP block intercepts before Heroku	Same gap as Public
DB writes accepted (DB-backed surfaces)	Yes	Yes	No — app-layer gate returns 503 with maintenance copy	N/A (origin not reachable via CF)	N/A	Yes	Yes	Yes
status.raxx.app entry created	No	No	Yes — "Scheduled Maintenance" with `estimated_return_at`	Yes — "Unexpected Downtime" with `started_at`, no ETA	No	No	No	No
Slack alert fires	No	No	Yes — "Maintenance window started on [surface]"	Yes — "INCIDENT: [surface] moved to Protect by [actor]"	No	No	No	No
Audit row written	Yes (on every transition)	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Human browser traffic served	Yes	Yes (if CF Access passes)	Maintenance page only	Protect/down page only	No	Yes (after challenge)	Only from allowlisted IPs	No
Agent/CI service-token traffic	Yes	Yes (service token must be in CF Access allowlist)	Maintenance page only	No	No	Yes (service tokens bypass JS challenge via non_identity decision)	Only from allowlisted IPs	Yes
Deploy pipeline allowed	Yes	Yes	Yes (deploys can land during maintenance window)	No — auto-engage per-surface deploy freeze	N/A	Yes	Yes	Yes

*Post-launch states

Critical implementation note on the Heroku origin row: Every state except Protect and Maintenance shows the existing Heroku origin bypass gap. This is the open finding documented in docs/security/web-surface-posture.md and docs/security/waf-threat-model-2026-05-11.md (HIGH-WAF-1). A surface-state toggle that places api.raxx.app in Protect via a CF-layer redirect is effective against traffic transiting Cloudflare. It does not block an attacker who already knows the .herokuapp.com URL. The Protect and Maintenance states must be implemented as CF Workers or CF Bulk Redirects, not as application-layer toggles, to have security value.

FLAG_ENFORCE_CF_ORIGIN status as of 2026-05-17T00:00:00Z (prerequisite MET):

App	`FLAG_ENFORCE_CF_ORIGIN`	Status
`raxx-api-prod`	`true`	Live
`raxx-console-prod`	`true`	Flipped 2026-05-17 (see #1741)
`raxx-console-staging`	`true`	Flipped 2026-05-17 (see #1741)

The prerequisite is now met across all three production-relevant Heroku apps. The surface state machine has the foundation it needs to be non-theater for Protect and Maintenance states. Remaining Heroku apps (e.g. Velvet, Queue) that carry their own hostname exposure should be audited on the same cadence.

Section 5 — Guardrails

RBAC roles authorized to flip

State transition	Minimum role required	TOTP elevation required
Any surface: Public ↔ Private	`ops`	No
Any surface: → Maintenance	`ops`	No
Any surface: Maintenance → Public or Private	`ops`	No
Any surface: → Protect	`superadmin`	Yes
Any surface: Protect → Private	`superadmin`	Yes
Protect → Public (re-opening to public after incident)	`superadmin`	Yes

A new fine-grained RBAC role console-surface-state-toggle should be introduced per [[project_rbac_model]] naming convention (<app>-<resource>-<level>). The role grants ops-level state toggles (Public, Private, Maintenance) without full ops role. This allows delegating surface control to a future on-call operator who does not have full ops access. Protect transitions always require superadmin regardless of this role. For v1 (single operator), console-surface-state-toggle is assigned to ops and superadmin groups only.

Two-person review requirement

For v1 (single operator): not applicable — there is no second person. The TOTP elevation requirement on Protect transitions serves as a friction gate that prevents accidental engagement.

Post-v1 if a second operator is added: require acknowledgement from a second active console session before Protect can be applied to api.raxx.app, app.raxx.app, or console.raxx.app. These three surfaces affect all customers.

Automatic vs. manual transitions

Manual only for all states at v1. The operator's goal ("don't have to run around") is served by a fast manual toggle, not by automation that might trigger incorrectly.

Post-launch hardening candidates for automatic transitions: - Auto-engage Protect if WAF block rate on a surface exceeds N events/minute for Y consecutive minutes (operator-configurable threshold). Requires CF Logpush integration. - Auto-engage Protect if the surface's liveness probe fails 5 consecutive checks (current status poller already runs these probes — this would be a new action on consecutive failures, not a new data source).

Auto-cascade is explicitly not recommended for v1. See Section 6.

Time-bound states and auto-revert

Maintenance state: must include a required estimated_return_at timestamp (UTC). If the surface is still in Maintenance state when estimated_return_at + 30 minutes passes without a manual extension or return-to-normal, the console dashboard tile emits a high-priority warning: "Maintenance window overrun." No automatic state change — human action required to exit Maintenance.

Protect state: no automatic revert. The incident may still be active. Protect persists until explicitly cleared by a superadmin with TOTP elevation. If Protect has been active for more than 6 hours without a renewal acknowledgement, the console tile emits a warning: "Protect state active 6h — confirm still needed."

Private state: no expiry.

Challenge, Allowlist-Only, Service-Token-Only (post-launch): all should support an optional expires_at timestamp, after which the state auto-reverts to the previous state with an audit row recording the auto-revert attributed to system. This prevents forgotten emergency configs from persisting indefinitely.

Maintenance scheduling support

For v1: immediate toggle only. No scheduled future transitions.

Post-launch: add a scheduled_at field to the state-change record. A background worker checks for pending scheduled transitions and applies them, writing an audit row attributed to system with the scheduled actor ID in payload_redacted.

Section 6 — Cascading dependencies

Surface dependency graph

The following surfaces have direct functional dependencies on other surfaces. A state change on an upstream surface degrades the dependent surface's functionality even if the dependent surface remains in its current state.

api.raxx.app (Raptor)
  ├── app.raxx.app depends on api.raxx.app (all customer-facing functionality)
  ├── console.raxx.app depends on api.raxx.app (health checks, some data reads)
  └── demo.raxx.app depends on api.raxx.app (demo backtest and strategy flows)

queue service (raxx-queue-prod.herokuapp.com)
  ├── api.raxx.app depends on queue (auth, RBAC, audit)
  ├── app.raxx.app depends on queue (via api.raxx.app)
  └── console.raxx.app depends on queue (customer data reads)

vault.raxx.app (Infisical)
  ├── Velvet rotation service depends on vault (secret reads and writes)
  ├── CI/CD deploy pipeline depends on vault (secret injection at deploy time)
  └── Raptor, Console, Queue depend on vault at startup/config-load time
      Per [[feedback_aws_workloads_use_ssm_not_vault]]: workload runtime secrets
      live in SSM; vault dependency is deploy-time and startup-time only.

tickets.raxx.app (FreeScout)
  └── No functional dependencies from other surfaces. Isolated third-party tool.

status.raxx.app
  └── No functional dependencies. Must remain up even when all other surfaces are down.
      Structurally guaranteed by CF Pages hosting.

console.raxx.app
  └── No inbound functional dependencies from customer surfaces.
      The deploy freeze check (per docs/ops/runbooks/deploy-freeze.md) depends on
      console being reachable; break-glass path (DEPLOY_FREEZE_OVERRIDE GH secret) exists.

getraxx.com, docs.raxx.app, raxx.app, raxx-mockups, raxx-app-previews, internal-docs.raxx.app
  └── Static CF Pages sites. No functional dependencies on other surfaces.

Cascade behavior recommendation: warn-only for v1

Do not implement auto-cascade for v1.

If api.raxx.app is placed in Protect, auto-cascading Protect to app.raxx.app is technically correct (the app is non-functional without the API) but adds complexity, creates a larger blast radius for accidental Protect engagement, and removes the operator's ability to leave the app surface accessible while the API is in maintenance (so customers can read an in-app message rather than encounter a hard Cloudflare error page).

The correct v1 behavior: when api.raxx.app transitions to Protect or Maintenance, the console dashboard displays dependency warnings on all dependent surface tiles — "api.raxx.app is in Protect — this surface may be degraded." The operator decides whether to also Protect dependent surfaces.

One exception to warn-only: when any surface transitions to Protect, the per-surface deploy freeze should be automatically engaged for that surface's deploy workflow. This prevents a deploy from landing on a surface in emergency mode. This extends #635 to the per-surface scope referenced in the deploy-freeze runbook as #1709.

vault.raxx.app special case: If vault goes to Protect, the Velvet secret-rotation cron will fail to reach vault. The operator must be warned explicitly on the vault tile: "vault.raxx.app is in Protect — Velvet rotation cron will fail. CF Access service tokens for machine identities will not pass the Protect redirect." Consider Service-Token-Only state (post-launch) for vault emergencies rather than full Protect, to preserve machine-identity access during investigation.

Dependency impact matrix

If this surface → Protect	Degraded surfaces	Auto-cascade recommended?
api.raxx.app	app.raxx.app (non-functional), demo.raxx.app (non-functional), console.raxx.app (degraded)	Warn-only
queue service	api.raxx.app (auth fails), app.raxx.app (non-functional), console.raxx.app (customer data unavailable)	Warn-only
vault.raxx.app	Velvet rotation cron fails; CI deploy secret injection fails at next deploy	Warn-only; surface the vault-specific warning explicitly
console.raxx.app	None (isolated admin plane)	N/A
tickets.raxx.app	None	N/A
status.raxx.app	None	Protect must be disabled for status.raxx.app — see below
getraxx.com, docs.raxx.app, raxx.app	None	N/A

status.raxx.app must be excluded from the Protect toggle entirely. The status page must remain reachable even during an attack on all other surfaces. It is the customer communication channel during incidents. Either exclude it from the state toggle UI or disable the Protect transition for that surface with an explicit UI label: "Status page cannot be taken offline — it is the customer communication channel during incidents."

Queue identity service interaction with surface state

Context (locked 2026-05-09): Queue is the authoritative identity and auth service for api.raxx.app. All authentication and RBAC evaluation for customer requests to Raptor flows through Queue. api.raxx.app calls Queue for every authenticated request: session validation, role resolution, and audit emission.

The gap this creates: The surface state machine as originally designed treats the CF / Bulk Redirect layer as the primary Protect mechanism for api.raxx.app. That CF-layer protection is effective for stopping inbound traffic. However, if the incident is Queue-specific — a compromised Queue instance, a Queue-side auth bypass, or a Queue availability failure — a surface state change on api.raxx.app at the CF layer does not address the Queue-layer exposure.

Two scenarios and the recommended posture for each:

Scenario A — api.raxx.app is the affected surface, Queue is healthy: Surface state toggle on api.raxx.app → Protect is sufficient. CF Worker intercepts traffic before Raptor. Queue is not involved because Raptor never receives the request. No coordinated Queue state change required. This is the common case.

Scenario B — Queue is the affected service (auth bypass, compromise, or availability failure): Placing api.raxx.app in Protect at the CF layer is still the correct first action — it stops customer traffic from reaching Raptor while investigation proceeds. However, this alone is not sufficient isolation if the Queue instance itself is compromised or behaving incorrectly, because:

Queue is an internal Heroku service reachable by Raptor regardless of CF-layer surface state.
A compromised Queue could still emit forged auth tokens or corrupt audit rows even while api.raxx.app is in Protect (if Raptor itself is still reachable from Queue's side).

Recommended posture for Scenario B:

When Queue is the suspected origin of an incident, the operator should:

Transition api.raxx.app → Protect (blocks all customer traffic at CF layer immediately).
Scale Queue dyno count to 0 via Heroku CLI (heroku ps:scale web=0 --app raxx-queue-prod) to stop the Queue service itself. This is an operational action outside the surface state machine, not a surface state toggle (Queue does not have its own CF-layer surface entry — it is an internal Heroku service, not a publicly routed surface).
Investigate with Queue logs and audit trail before scaling back.

What "Queue remains live during Protect" means in practice:

For Scenario A (the common case), Queue identity should remain live when api.raxx.app is in Protect. There is no reason to stop Queue during a non-Queue incident, and doing so would impair the operator's ability to use the console (which calls Queue for customer data reads). The surface state machine's Protect action on api.raxx.app is not a signal to shut down Queue.

Console runbook addition (recommended): The operator runbook for api.raxx.app → Protect should include a decision fork: "Is Queue suspected as the origin of this incident? If yes, follow Queue-specific runbook (scale to 0 + investigate). If no, Protect on api.raxx.app is sufficient." This decision fork does not need to be automated — it is a human judgment call that happens in the first minutes of incident response.

Post-launch hardening: A dedicated Queue surface entry in the console surface dashboard (currently invisible because Queue is an internal Heroku service with no CF-layer presence) would allow Queue state to be tracked alongside other surfaces. This is post-launch scope; for v1, the Queue operational posture relies on the Heroku CLI action documented in the runbook.

Section 7 — What this doesn't cover

The surface-state machine is a blunt instrument. Several real attack scenarios require controls that are adjacent to but not part of the state machine.

User-level controls (account takeover)

If a single customer account is compromised, the correct response is account-level session revocation, not a surface-level state change. The surface state machine has no knowledge of individual user sessions. User-level kill switches (revoke all sessions for a specific customer, disable a specific customer's account) belong in a customer-state model that the Queue service owns. Recommend a separate user-state machine as a post-launch hardening item. This is not designed yet.

Deploy pipeline (supply chain attacks)

A compromised dependency (malicious npm or Python package injected into the supply chain) is not addressed by surface state. The surface is already running the malicious artifact; Protecting it does not remove it. The correct response is a global deploy freeze (#635) plus manual investigation. Surface state cannot help once a malicious artifact is already deployed; it can only prevent further deploys from landing. The per-surface deploy freeze (#1709) extends this per-surface.

Endpoint-level kill switches (feature flags)

The operator's proposed model is surface-wide. Endpoint-level controls belong to the feature flag system, not the surface state machine. The two systems compose rather than compete:

Surface state machine controls CF-layer and origin-reachability for an entire surface.
Feature flags control individual endpoints or behaviors within a surface that is up.

The read-only / write-disabled scenario (Gap A1) is best implemented as a feature flag (FLAG_WRITES_DISABLED, FLAG_ORDER_SUBMISSION_DISABLED) rather than a surface state. Surface state answers "is the surface accessible?"; feature flags answer "what can you do once you're in?"

This composition means a surface can be Public (accessible) with specific feature flags disabled (certain endpoints returning 503). This is more surgical than surface state changes and does not require CF API calls.

L4 DDoS

TCP flood-level DDoS is absorbed by Cloudflare's network-layer DDoS protection regardless of surface state. The surface-state machine has no lever for L4 attacks. If Cloudflare's automatic DDoS protection is overwhelmed, the correct escalation is to Cloudflare support, not a surface state change.

Slow-loris / connection exhaustion

Connection exhaustion attacks target Heroku dynos by holding open connections. Surface state changes at the CF layer do not help because CF is already terminating connections before the origin. The correct defense is FLAG_ENFORCE_CF_ORIGIN=true on all origin apps (which blocks direct-to-Heroku connections) plus Heroku connection timeout settings. This is an sre-agent concern.

Insider threat (operator account compromise)

If the operator's CF Access or console account is compromised, the attacker can toggle the surface state machine themselves. The surface state machine is not the right control for this scenario. The right controls are: strong operator authentication (TOTP elevation already required for Protect), CF Access audit log monitoring (in scope for this security agent), and short session lifetimes (per docs/security/auth-posture.md Section 9). The surface state machine cannot protect against an attacker who controls the account making the state change.

Honest scope statement

The surface state machine is a response speed tool. It compresses the time between "operator recognizes something is wrong" and "the impacted surface is in a controlled state." It does not prevent an attack; it limits the damage window after an attack is detected. Its security value is entirely dependent on: (1) the operator detecting the attack quickly, and (2) the Protect and Maintenance states being implemented at the CF layer, not just the app layer. FLAG_ENFORCE_CF_ORIGIN=true is now live on all production-relevant Heroku apps as of 2026-05-17T00:00:00Z (see Section 4 status table). The foundational prerequisite is met; remaining gap is ensuring CF Worker / Bulk Redirect delivery for Protect and Maintenance states at the CF layer rather than the app layer.

Section 8 — Recommendations for v1 launch

Strict minimum for launch day (2026-05-23 UTC)

The Public/Private toggle is the minimum viable control plane and is already largely in place for pre-launch gating. The addition is a console-driven toggle with an audit row per surface, making what is currently a manual CF dashboard operation into a one-click console action with an audit trail.

Public and Private states — toggle CF Access on/off per surface via CF Access API. Audit row on every toggle. No status.raxx.app integration required for these two states.
Audit row on every state change — action=surface.state.changed, target_type=surface, target_id=surface_id, payload_redacted containing old state, new state, actor ID, and reason. Written synchronously before the CF API call.
Deploy freeze auto-engagement on Protect — when any surface transitions to Protect, automatically engage the per-surface deploy freeze per #1709. This prevents deploys from landing on surfaces in emergency mode.

Recommended for launch but not blocking

Protect state — CF Worker or CF Bulk Redirect serves the branded takedown page at the CF layer. status.raxx.app entry created automatically. Slack DM to operator. TOTP elevation required. The status.raxx.app exclusion from Protect must be enforced at the UI level.
Maintenance state — distinct copy from Protect. Required estimated_return_at in UTC. status.raxx.app "Scheduled Maintenance" entry. Near-zero additional engineering if Protect infrastructure is built first.
Dependency warning badges — when an upstream surface changes state, dependent tiles show a warning badge. No auto-cascade. Warn only.
Offline state — surfaces in the registry that are not actively serving traffic get this state. Prevents confusing "healthy" vs "incident" display for known-inactive surfaces like support.raxx.app.

Post-launch hardening

Challenge state (CF WAF rule-set toggle via CF API)
Allowlist-Only state (CF WAF custom rule toggle)
Service-Token-Only state (CF Access service-token policy toggle, particularly for vault.raxx.app)
Scheduled transitions (background worker + scheduled_at field)
Two-person review (when second operator is added to the platform)
Auto-Protect triggers (CF Logpush integration + threshold-based auto-engage)
User-state machine (separate system; Queue service owner)

Implementation complexity per state

State	CF API calls required	New infra	App/console changes	Complexity
Public	Yes (CF Access policy toggle)	No	Audit row, RBAC gate	Low
Private	Yes (CF Access policy toggle)	No	Audit row, RBAC gate	Low
Offline	Possible (DNS deregistration)	No	Registry update, audit row	Low
Maintenance	Yes (CF Worker deploy or Bulk Redirect)	CF Worker	Audit row, status.raxx.app API call, `estimated_return_at` field	Medium
Protect	Yes (CF Worker deploy or Bulk Redirect — same infra as Maintenance)	Shared with Maintenance	Audit row, status.raxx.app call, deploy freeze call, Slack DM	Medium
Challenge	Yes (CF WAF rule-set toggle)	WAF rule presets per surface	Audit row	High
Allowlist-Only	Yes (CF WAF custom rule toggle)	WAF custom rules per surface	Audit row	Medium-High
Service-Token-Only	Yes (CF Access policy modification)	No	Audit row	High

Shared infrastructure note: Maintenance and Protect share the same CF delivery mechanism (CF Worker or Bulk Redirect serving a static page). Building one builds the other. The implementation cost for Protect is near-zero if Maintenance infrastructure is already built. Recommend implementing both in the same sprint.

Section 9 — Sub-card slate for PM

Cards are listed for PM to lift and groom. Not filed by this agent.

Core infrastructure (pre-launch priority)

surface-state: DB schema — add surface_states table (surface_id, state, changed_by_admin_id, reason, estimated_return_at, changed_at, FK to surface registry). Add surface_state_history for full audit trail.
surface-state: Toggle API endpoint — POST /api/surfaces/<surface_id>/state with body {state, reason, estimated_return_at?}. @require_role("ops") for Public/Private/Maintenance; @require_role("superadmin") + @require_totp_elevation for Protect. Audit row on every call before CF API call.
surface-state: Console tile UI — toggle on each dashboard tile with confirmation modal for destructive transitions (→ Protect, → Maintenance). Current state badge, changed_at, actor display. Dependency warning badges when upstream surface changes state. status.raxx.app Protect transition disabled on that tile with explanatory label.
surface-state: Audit emission — surface.state.changed event to audit_log with old_state, new_state, reason, actor_id, ip_address. Written before CF API call.
surface-state: status.raxx.app integration — API/webhook call on → Protect and → Maintenance. Entry shape: surface_id, state_type (incident vs maintenance), started_at (UTC), estimated_return_at (Maintenance only).

Per-state implementations

surface-state: CF Access toggle (Public ↔ Private) — CF Access API call to enable/disable the CF Access application per surface. Uses CF_ACCESS_MGMT token per docs/security/auth-posture.md §8.
surface-state: Protect CF Worker — CF Worker (or Bulk Redirect) intercepting all requests to the surface's hostname and serving the branded takedown HTML template. Deployable via CF API without Terraform apply. Per-surface template with surface name, started_at, status page link.
surface-state: Maintenance CF Worker — same infrastructure as Protect Worker, different HTML template with "Scheduled Maintenance" copy and estimated_return_at display.
surface-state: Per-surface deploy freeze on Protect — extend #635 to auto-engage per #1709 when state → Protect. Auto-release when state exits Protect.
surface-state: Slack DM on Protect — POST to Kristerpher's DM channel on → Protect with: surface name, actor, reason, status page link. Per-event, not digest (Protect qualifies as a security CRIT/HIGH event per #1364).
surface-state: Maintenance overrun warning — background check: if surface is in Maintenance and now > estimated_return_at + 30 min, emit a high-priority dashboard warning on the tile. No auto-revert.
surface-state: Protect 6-hour acknowledgement — if surface has been in Protect for 6+ hours without a renewal, emit a warning on the tile: "Protect state active 6h — confirm still needed."

RBAC

surface-state: RBAC role console-surface-state-toggle — new fine-grained RBAC role. Grants Public/Private/Maintenance/Offline toggles. Does not grant Protect (superadmin + TOTP). Assign to ops and superadmin groups at creation.

Post-launch sub-cards

surface-state: Challenge state — CF WAF rule preset toggle per surface via CF WAF API.
surface-state: Allowlist-Only state — CF WAF custom rule toggle. Requires operator home CIDR and agent runner CIDR config.
surface-state: Service-Token-Only state — CF Access policy modification to restrict to decision=non_identity. Highest value for vault.raxx.app incident response.
surface-state: Scheduled transitions — scheduled_at field + background worker.
surface-state: Auto-Protect triggers — CF Logpush integration + threshold configuration. Operator approval gate before enabling.
surface-state: User-state machine — separate card set; Queue service owner; user-level session revocation and account disable, distinct from surface-level controls.