RCA — raxx-queue-prod zero-dyno state: vcpkg shallow clone blocks all CI deploys

Incident ID: 2026-05-27-queue-deploy-vcpkg-shallow-clone Date: 2026-05-27 Severity: SEV-2 Duration: ~9 days (first failure 2026-05-18 23:27 UTC → fix PR #2830 filed 2026-05-27) Blast radius: raxx-queue-prod never received a container deploy; Queue prod is not serving. Staging was unaffected (last deployed correctly, web.1 up). Author: sre-agent

Summary

Every deploy-queue.yml CI run triggered by a push to main since 2026-05-18 has failed at the build-test job. The build-test job shallow-clones vcpkg with --depth 1, but queue/vcpkg.json pins builtin-baseline to a historical SHA (3508985146f1b1d248c67ead13f8f54be5b4f5da). A shallow clone cannot resolve historical tree objects, causing vcpkg to abort with "failed to unpack tree object" for every port (jwt-cpp, spdlog, gtest, nlohmann-json, sentry-native). Because build-test gates both build-container and deploy-prod, no container has ever been pushed to raxx-queue-prod — Repo Size: 0 B. Fix is to remove --depth 1 from the CI test job clone, matching what the production Dockerfile already does.

Timeline (all times UTC)

2026-05-18 23:27 — First failure on run 26066403263 (PR #2453 merged). build-test fails at vcpkg install inside docker container.
2026-05-15 21:50, 05:29, 05:27, 05:19 — Four earlier failures (runs 25943119277, 25902046558, 25901974505, 25901729569) — same root cause, same shallow-clone error.
2026-05-19 06:31 — Operator sets config vars on raxx-queue-prod (v6), unaware CI has never deployed a container (Repo Size still 0 B).
2026-05-27 ~09:00 UTC — Operator reports #2719: raxx-queue-prod has no dynos. SRE-agent investigates.
2026-05-27 09:15 UTC — Root cause identified in GH Actions log: "vcpkg was cloned as a shallow repository. Try again with a full vcpkg clone."
2026-05-27 09:24 UTC — Fix committed, worktree branch pushed, PR #2830 filed.

Impact

Users affected: 0 (pre-launch; Queue prod is not yet customer-facing)
User-visible symptoms: None (Queue prod was not yet in the traffic path)
Data integrity: ok
Revenue / billing: Queue billing flag is off (FLAG_QUEUE_BILLING=0); no revenue impact

What went well

Staging Queue was healthy throughout (web.1 up, /health passing), providing a reference point.
The CI failure was clearly logged — vcpkg emitted an explicit "shallow repository" error with a fix link.
The production Dockerfile already documents the full-clone requirement in a comment.

What didn't go well

The CI test job and the production Dockerfile diverged silently. The Dockerfile had the correct pattern; the CI test job used --depth 1 (appropriate for checking out code, not for vcpkg).
No alert fired when all deploy-queue runs started failing. There is no monitor on "deploy pipeline has been red for N consecutive runs."
The Heroku app showed 0 dynos with no config-set-triggered alarms, allowing the gap to persist for 9 days undetected.

Root cause analysis

Contributing factor 1: Divergent CI/Dockerfile vcpkg clone strategy — The build-test job's inline bash script used --depth 1 for the vcpkg clone (a common CI optimization for source repos). This is incorrect for vcpkg because builtin-baseline resolution requires traversing historical commits. The production Dockerfile already explains this and uses a full clone. The CI job was added or modified without cross-referencing the Dockerfile comment.
Contributing factor 2: No deploy-pipeline failure monitor — Five consecutive CI failures across 9 days went undetected. No runbook check, no Sentry alert, no Heroku dyno count monitor triggered on the zero-dyno state.
Contributing factor 3: Heroku app creation preceded first successful deploy — The app was created and config vars were set (v1–v6) before any container was ever pushed. heroku ps showed "No dynos" which is also what a freshly-created app shows — there was no obvious signal distinguishing "never deployed" from "intentionally scaled to zero."

Detection

What alerted us: Operator reported #2719 ("raxx-queue-prod has no Procfile — zero dynos").
How long between cause and detection: ~9 days (first CI failure 2026-05-18, detection 2026-05-27).
How to detect faster next time: Monitor CI run success rate on deploy-queue.yml; alert when >2 consecutive failures. Monitor heroku ps -a raxx-queue-prod for zero-dyno state on a service expected to be up.

Resolution

What was changed: Removed --depth 1 from the git clone command in the build-test job's inline bash (.github/workflows/deploy-queue.yml line 197). PR #2830.
Validation: After PR merges to main, push trigger will run the full vcpkg install. Successful build-test unblocks build-container and deploy-staging. Prod deploy then requires workflow_dispatch target=prod confirm=deploy-prod-now per ADR-0020.

Action items

#	Action	Owner	Due	Issue
1	Merge PR #2830, trigger workflow_dispatch prod deploy	operator	2026-05-28	#2719
2	Write live Stripe keys to vault at /MooseQuest/Queue/prod/	operator	2026-05-28	#2443
3	Add deploy-queue CI failure monitor (alert when >2 consecutive failures on deploy-queue.yml)	sre-agent	2026-06-03	new
4	Add Heroku zero-dyno monitor for raxx-queue-prod (alert if web dynos == 0 for >10 min)	sre-agent	2026-06-03	new
5	Add comment in deploy-queue.yml build-test job cross-referencing Dockerfile vcpkg clone note	sre-agent	2026-06-03	new

References

PR: #2830 (fix)
Issues: #2719 (zero-dyno report), #2443 (Stripe keys operator-action)
GH Actions run: 26066403263 (last failure, showing the vcpkg error)
Dockerfile comment: queue/Dockerfile line 34-36
vcpkg versioning doc: https://learn.microsoft.com/vcpkg/users/versioning-troubleshooting