RCA — raxx-queue-prod zero-dyno state: vcpkg shallow clone blocks all CI deploys
Incident ID: 2026-05-27-queue-deploy-vcpkg-shallow-clone Date: 2026-05-27 Severity: SEV-2 Duration: ~9 days (first failure 2026-05-18 23:27 UTC → fix PR #2830 filed 2026-05-27) Blast radius: raxx-queue-prod never received a container deploy; Queue prod is not serving. Staging was unaffected (last deployed correctly, web.1 up). Author: sre-agent
Summary
Every deploy-queue.yml CI run triggered by a push to main since 2026-05-18 has failed at the build-test job. The build-test job shallow-clones vcpkg with --depth 1, but queue/vcpkg.json pins builtin-baseline to a historical SHA (3508985146f1b1d248c67ead13f8f54be5b4f5da). A shallow clone cannot resolve historical tree objects, causing vcpkg to abort with "failed to unpack tree object" for every port (jwt-cpp, spdlog, gtest, nlohmann-json, sentry-native). Because build-test gates both build-container and deploy-prod, no container has ever been pushed to raxx-queue-prod — Repo Size: 0 B. Fix is to remove --depth 1 from the CI test job clone, matching what the production Dockerfile already does.
Timeline (all times UTC)
- 2026-05-18 23:27 — First failure on run 26066403263 (PR #2453 merged). build-test fails at vcpkg install inside docker container.
- 2026-05-15 21:50, 05:29, 05:27, 05:19 — Four earlier failures (runs 25943119277, 25902046558, 25901974505, 25901729569) — same root cause, same shallow-clone error.
- 2026-05-19 06:31 — Operator sets config vars on raxx-queue-prod (v6), unaware CI has never deployed a container (Repo Size still 0 B).
- 2026-05-27 ~09:00 UTC — Operator reports #2719: raxx-queue-prod has no dynos. SRE-agent investigates.
- 2026-05-27 09:15 UTC — Root cause identified in GH Actions log: "vcpkg was cloned as a shallow repository. Try again with a full vcpkg clone."
- 2026-05-27 09:24 UTC — Fix committed, worktree branch pushed, PR #2830 filed.
Impact
- Users affected: 0 (pre-launch; Queue prod is not yet customer-facing)
- User-visible symptoms: None (Queue prod was not yet in the traffic path)
- Data integrity: ok
- Revenue / billing: Queue billing flag is off (
FLAG_QUEUE_BILLING=0); no revenue impact
What went well
- Staging Queue was healthy throughout (web.1 up, /health passing), providing a reference point.
- The CI failure was clearly logged — vcpkg emitted an explicit "shallow repository" error with a fix link.
- The production Dockerfile already documents the full-clone requirement in a comment.
What didn't go well
- The CI test job and the production Dockerfile diverged silently. The Dockerfile had the correct pattern; the CI test job used
--depth 1(appropriate for checking out code, not for vcpkg). - No alert fired when all deploy-queue runs started failing. There is no monitor on "deploy pipeline has been red for N consecutive runs."
- The Heroku app showed 0 dynos with no config-set-triggered alarms, allowing the gap to persist for 9 days undetected.
Root cause analysis
- Contributing factor 1: Divergent CI/Dockerfile vcpkg clone strategy — The
build-testjob's inline bash script used--depth 1for the vcpkg clone (a common CI optimization for source repos). This is incorrect for vcpkg becausebuiltin-baselineresolution requires traversing historical commits. The production Dockerfile already explains this and uses a full clone. The CI job was added or modified without cross-referencing the Dockerfile comment. - Contributing factor 2: No deploy-pipeline failure monitor — Five consecutive CI failures across 9 days went undetected. No runbook check, no Sentry alert, no Heroku dyno count monitor triggered on the zero-dyno state.
- Contributing factor 3: Heroku app creation preceded first successful deploy — The app was created and config vars were set (v1–v6) before any container was ever pushed.
heroku psshowed "No dynos" which is also what a freshly-created app shows — there was no obvious signal distinguishing "never deployed" from "intentionally scaled to zero."
Detection
- What alerted us: Operator reported #2719 ("raxx-queue-prod has no Procfile — zero dynos").
- How long between cause and detection: ~9 days (first CI failure 2026-05-18, detection 2026-05-27).
- How to detect faster next time: Monitor CI run success rate on deploy-queue.yml; alert when >2 consecutive failures. Monitor
heroku ps -a raxx-queue-prodfor zero-dyno state on a service expected to be up.
Resolution
- What was changed: Removed
--depth 1from thegit clonecommand in thebuild-testjob's inline bash (.github/workflows/deploy-queue.ymlline 197). PR #2830. - Validation: After PR merges to main, push trigger will run the full vcpkg install. Successful build-test unblocks build-container and deploy-staging. Prod deploy then requires
workflow_dispatch target=prod confirm=deploy-prod-nowper ADR-0020.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Merge PR #2830, trigger workflow_dispatch prod deploy | operator | 2026-05-28 | #2719 |
| 2 | Write live Stripe keys to vault at /MooseQuest/Queue/prod/ | operator | 2026-05-28 | #2443 |
| 3 | Add deploy-queue CI failure monitor (alert when >2 consecutive failures on deploy-queue.yml) | sre-agent | 2026-06-03 | new |
| 4 | Add Heroku zero-dyno monitor for raxx-queue-prod (alert if web dynos == 0 for >10 min) | sre-agent | 2026-06-03 | new |
| 5 | Add comment in deploy-queue.yml build-test job cross-referencing Dockerfile vcpkg clone note | sre-agent | 2026-06-03 | new |
References
- PR: #2830 (fix)
- Issues: #2719 (zero-dyno report), #2443 (Stripe keys operator-action)
- GH Actions run: 26066403263 (last failure, showing the vcpkg error)
- Dockerfile comment:
queue/Dockerfileline 34-36 - vcpkg versioning doc:
https://learn.microsoft.com/vcpkg/users/versioning-troubleshooting