RCA — Raptor prod release command failed: in-process smoke blocked by CF origin guard
Incident ID: 2026-06-21-release-smoke-cf-origin-guard Date: 2026-06-21 Severity: SEV-3 Duration: ~6 minutes (19:29 UTC dispatch → 19:35 UTC confirmed, release failure noticed immediately post-deploy) Blast radius: Heroku marked raxx-api-prod v154 release as failed. No user impact — app is healthy, migration applied, dyno running. Author: sre-agent
Summary
The Raptor prod deploy for the 2026-06-21 six-PR wave succeeded in all meaningful respects: migration 0046 applied cleanly, the dyno booted, and /health returned 200. However, the Heroku release phase exited non-zero because 14 in-process smoke tests across two files (test_beta_walk_route_smoke_3682.py and test_beta_walkthrough_route_smoke_3455.py) failed with 403 direct_origin_blocked instead of expected 404/401 responses. The cause: FLAG_ENFORCE_CF_ORIGIN=1 is set on raxx-api-prod config-vars. The in-process test _make_inprocess_client() helper calls create_app(), which inherits FLAG_ENFORCE_CF_ORIGIN=1 from the dyno environment — causing the CF origin guard to fire on every test-client request before route logic executes. The fix is to override FLAG_ENFORCE_CF_ORIGIN=0 in the _TEST_APP_CONFIG dict used by both smoke files.
Timeline (all times UTC)
- 19:29 — SRE dispatched
deploy-heroku.yml --field environment=production(run 27915058139) - 19:29 — Freeze check: success; PII scan: success
- 19:32 — Smoke gate passed; deploy job started
- 19:35 — Heroku push completed; release phase ran:
alembic upgrade 0044 → 0045 → 0046applied cleanly - 19:35 — Release phase continued to pytest smoke suite; 14 tests failed with 403; release phase exited non-zero
- 19:35 — Heroku marked v154 as "release command failed"; dyno remained on previously-healthy code briefly, then v154 deployed anyway (Heroku deploys the slug even when the release command fails — the release command failure is logged but does not prevent the dyno from starting)
- 19:35 — SRE polled
gh run view 27915058139 --json conclusion:success(GH Actions workflow succeeded because the health check passed; the Heroku release command failure is not surfaced to GH Actions) - 19:36 —
heroku releases --app raxx-api-prodshowed v154 with "release command failed" annotation - 19:36 —
heroku releases:output v154pulled: confirmed Alembic 0046 applied, then 14 pytest failures, all 403direct_origin_blocked - 19:37 — Root cause confirmed:
FLAG_ENFORCE_CF_ORIGIN=1leaks into in-processcreate_app()call; guard fires before route logic - 19:40 —
/health200, DB at 0046 head confirmed. No rollback warranted.
Impact
- Users affected: 0 (beta program not yet open; FLAG_BETA_NDA_ACK and FLAG_BETA_WALKTHROUGH are both OFF)
- User-visible symptoms: none (app healthy, all features flag-gated)
- Data integrity: ok (migration applied; no writes attempted by smoke tests)
- Revenue / billing: ok
What went well
- Migration 0046 applied completely before the smoke tests ran — Alembic failure would have blocked smoke; the reverse ordering meant schema was safe regardless of smoke outcome
- Health check passed immediately; dyno came up on the new code
- The release log was available via
heroku releases:outputwithout needing Heroku CLI login (API key from vault) - GH Actions run conclusion was verified via
--json conclusion(notwatch --exit-status) per SOP
What didn't go well
- The in-process smoke tests did not account for
FLAG_ENFORCE_CF_ORIGINbeing enabled on the prod dyno environment — a middleware flag that was documented as "default off, so CI smoke tests are never broken" but only stays off if the env var is absent - Two smoke files shipped with this gap (
test_beta_walk_route_smoke_3682.pyandtest_beta_walkthrough_route_smoke_3455.py), both added within the last 9 days - The Heroku "release command failed" annotation is not surfaced by the GH Actions deploy workflow — it only appears in
heroku releases, making it invisible unless explicitly checked - No automated check verifies that the Heroku release command exited 0 after a successful GH Actions deploy
Root cause analysis
- Contributing factor 1: In-process smoke
_TEST_APP_CONFIGmissingFLAG_ENFORCE_CF_ORIGINoverride — Both smoke files callcreate_app(config)whereconfigis{**_TEST_APP_CONFIG, "DATABASE": ...}._TEST_APP_CONFIGdoes not setFLAG_ENFORCE_CF_ORIGIN. The app factory inheritsFLAG_ENFORCE_CF_ORIGIN=1from the dyno environment (a prod config-var). The CF origin guard is registered as abefore_requesthook; every test-client request from127.0.0.1lacksCF-Connecting-IPand gets 403 before route logic runs. - Contributing factor 2: CF origin guard docstring claim not enforced —
cloudflare_origin_guard.pystates "default off, so local dev, CI smoke tests, and fresh deploys are never broken by this guard." This is only true ifFLAG_ENFORCE_CF_ORIGINis absent. The docstring did not mention that the guard inherits the dyno environment, and the smoke test authors relied on the default-off guarantee without explicitly pinning it. - Contributing factor 3: No post-deploy Heroku release-command exit code visibility in GH Actions — The
deploy-heroku.ymlworkflow checks/healthand declares success if 200. It does not query the Heroku release's exit code. A release command failure is silent from GH Actions perspective, making it a detection gap for future post-deploy hygiene regressions.
Detection
- What alerted us: SRE checked
heroku releasesas part of migration-head verification (step 5 of the deploy SOP) - How long between cause and detection: ~1 minute (release failure happened at 19:35; SRE queried
heroku releasesat 19:36) - How to detect faster next time: add release exit-code check to the GH Actions deploy workflow's notify job (action item 2)
Resolution
- Migration: 0046 applied. DB at correct head.
- App: healthy,
/health200, dyno up. - Smoke defect: documented below as action items — fix required before next deploy to prevent recurrence.
- No rollback taken. No operator action required for the current state.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Fix _TEST_APP_CONFIG in test_beta_walk_route_smoke_3682.py and test_beta_walkthrough_route_smoke_3455.py to add "FLAG_ENFORCE_CF_ORIGIN": "0" (or equivalent env override) so in-process smoke is immune to dyno env-var leakage |
feature-developer | 2026-06-23 | file |
| 2 | Add release-command exit-code check to deploy-heroku.yml: after the health check passes, query heroku releases --app <app> --num 1 --json and fail the workflow if the latest release has status=failed |
sre-agent | 2026-06-28 | file |
| 3 | Update cloudflare_origin_guard.py docstring to clarify that "default off" means "absent from the environment" — in-process test clients must explicitly pin FLAG_ENFORCE_CF_ORIGIN=0 in their test config to be immune to dyno env leakage |
feature-developer | 2026-06-23 | on same PR as action item 1 |
References
- Runbook:
docs/ops/runbooks/raptor.md - Related incidents:
docs/incidents/2026-06-10-beta-walk-500.md,docs/incidents/2026-06-12-beta-walkthrough-500.md - Heroku release log:
heroku releases:output v154 --app raxx-api-prod - GH Actions run: 27915058139 (conclusion: success — health check passed; release command failure not surfaced)