RCA — internal-docs.raxx.app/flags/* soft-404 (main page served at every path)
Incident ID: 2026-05-18-internal-docs-flags-soft-404
Date: 2026-05-18
Severity: SEV-3
Duration: ~2 min deploy gap (23:25–23:27 UTC) — ongoing until fix merged
Blast radius: Internal-only. All /flags/* paths on internal-docs.raxx.app served the 88,101-byte main landing page instead of flag docs. No customer-facing impact; CF Access gating was intact.
Author: sre-agent
Summary
Two GitHub Actions workflows (deploy-flag-docs.yml and deploy-internal-docs.yml) both targeted the same Cloudflare Pages project (raxx-internal-docs). Each wrangler pages deploy creates a new immutable deployment that replaces the previous one as the production deployment. The deploy-internal-docs.yml run at 23:27 UTC overwrote the deploy-flag-docs.yml run from 23:25 UTC, erasing the /flags/* subpath content. CF Pages then served its SPA fallback (the main index.html) for every unmatched /flags/* path — a soft-404 that returned HTTP 200 with 88 KB of the wrong page.
The fix merges both builds into a single deploy: deploy-internal-docs.yml now builds flag docs into a staging directory and merges the output under dist/internal-docs/flags/ before the single wrangler pages deploy. deploy-flag-docs.yml retains its build + validate jobs as a PR-time gate but no longer deploys.
Timeline (all times UTC)
- 23:25 —
deploy-flag-docs.ymlrun #26066319545 completes successfully, sha4c49439f.raxx-internal-docsproduction deployment updated;/flags/*content live. - 23:27 —
deploy-internal-docs.ymlrun #26066403264 completes. Triggered by path filter onscripts/_pony_render.py(same sha4c49439f).wrangler pages deploy dist/internal-docscreates a new production deployment containing only the main docs — no/flags/directory. CF Pages marks this as the new production deployment, replacing the flag-docs deploy from 23:25. - 23:27+ — All
/flags/*paths begin returning the mainindex.htmlSPA (soft-404).curl /flags/search-index.jsonreturns HTML, not JSON. - 2026-05-18 (day-of) — Incident reported. Confirmed via curl: every
/flags/*path returns 88,101 bytes of main landing page. - 2026-05-18 — sre-agent diagnosed root cause, authored fix (PR: merge both builds into one deploy), filed this RCA.
Impact
- Users affected: 0 (internal-docs.raxx.app is CF Access gated — email allowlist)
- User-visible symptoms: none (internal-only tool)
- Data integrity: ok — no data was lost; flag docs were still buildable from YAML
- Revenue / billing: ok
What went well
- The CF Access gate meant no customer-facing exposure of the broken state.
- Both workflow runs completed successfully by their own metrics — the failure mode was architectural (two deployers, one target), not a code bug.
- The soft-404 was detectable immediately via curl size check (
/flags/search-index.jsonreturning HTML).
What didn't go well
- The system allowed two separate workflow files to both call
wrangler pages deployagainst the same CF Pages project with no coordination. CF Pages last-writer-wins silently. - No post-deploy smoke check verified
/flags/*— only the root was smoke-tested. - The
deploy-flag-docs.ymlheader comment said it "publishes to raxx-internal-docs" but gave no indication this conflicted withdeploy-internal-docs.yml. - The path filters were independent, so any push touching either set of paths could trigger both workflows on the same sha, racing each other.
Root cause analysis
-
Contributing factor 1: Two workflows, one CF Pages target.
deploy-flag-docs.ymlanddeploy-internal-docs.ymlboth calledwrangler pages deploy --project-name=raxx-internal-docs. CF Pages replaces the production deployment on each call; the later call wins. The system provided no guard against two parallel deployers on the same project. -
Contributing factor 2: Same sha triggered both workflows. Commit
4c49439fmodifiedscripts/_pony_render.py, which is in both workflows' path filters (directly indeploy-internal-docs.yml; indirectly since_pony_render.pyis shared by both scripts). Both workflows fired at the same sha 2 minutes apart, guaranteeing the race. -
Contributing factor 3: Smoke check did not cover
/flags/. The post-deploy smoke indeploy-internal-docs.ymlverifiedhttps://internal-docs.raxx.app/(the root) but not/flags/_index.html. A broken/flags/would have passed smoke and gone undetected until manual inspection. -
Contributing factor 4: No "two deployers on one project" detection. There is no CI lint or convention that prevents two workflow files from targeting the same CF Pages project name. The pattern is invisible until it races.
Detection
- What alerted us: Manual curl inspection comparing expected vs actual response content and size
- How long between cause and detection: Unknown (first confirmed report 2026-05-18; race window was 23:25–23:27 UTC the prior night)
- How to detect faster next time: The merged smoke check in this PR now verifies
/flags/_index.htmlexplicitly; if it returns the main-page marker, the deploy fails. Additionally: add the/flags/check to any future monitoring synthetic that probesinternal-docs.raxx.app.
Resolution
- What was changed:
1.
deploy-internal-docs.yml: addedscripts/build-flag-docs.pyandbackend_v2/api/feature_flags.yamlto thepush.pathsfilter. Addedpyyamlto pip install. Afterbuild-internal-docs.py, runsbuild-flag-docs.py --out dist/_flag_staging, mergesflags/content intodist/internal-docs/flags/, and deploys once as a unified site. Smoke check now also verifies/flags/_index.html. 2.deploy-flag-docs.yml: removed thedeployjob entirely. Retainedbuild(push-time validation) andci-build-check(PR-time gate). Added "VALIDATION ONLY" header comment explaining the architectural change and pointing to this RCA. - Validation: post-merge CI run of
deploy-internal-docs.ymlsmoke check must pass both root (Raxx · internal docs) and/flags/(Feature flag index) markers.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add a CI lint check that detects two workflow files targeting the same CF Pages --project-name and fails the PR |
devops | 2026-06-01 | (filed below) |
| 2 | Add /flags/_index.html to synthetic monitoring probe for internal-docs.raxx.app |
sre-agent | 2026-05-25 | (filed below) |
| 3 | Review all other workflow pairs for "two deployers / one Pages project" pattern | sre-agent | 2026-05-25 | (filed below) |
References
- Runbook:
docs/ops/runbooks/cloudflare-pages.md(to be created if not exists) - Related PR: deploy unification PR (this fix)
- CF Pages docs: Deployments are immutable; the latest deployment to a branch becomes the production URL —
https://developers.cloudflare.com/pages/configuration/deployments/