RCA — Queue billing schema absent from prod: sqitch.plan format + release-phase pipeline mismatch
Incident ID: 2026-06-17-queue-billing-sqitch-plan-format Date: 2026-06-17 Severity: SEV-2 Duration: ~3.5h total (13:20 UTC first signal → 15:07 UTC first 2xx webhook verified) Blast radius: raxx-queue-prod had no billing tables; every Stripe webhook with a valid signature resulted in a DB 500 after signature verification passed; Queue billing was non-functional at go-live. Author: sre-agent
Summary
Two independent bugs prevented billing schema from being applied to the prod database on every Container Registry deploy since Queue launched. First, queue/migrations/sqitch/sqitch.plan had timestamps wrapped in brackets ([2026-05-11T00:00:00Z]) — sqitch 1.3.1 parses [...] after a change name as a dependency list, not a timestamp. This caused "Invalid name" syntax error and sqitch exited 0 with no changes deployed. Second, the heroku.yml release phase command is not honored when deploying via heroku container:release — the Container Registry and heroku.yml pipelines are mutually exclusive; the release command was silently skipped on every deploy. PR #3620 (merged 2026-06-17) correctly copied migrations into the Docker image, which was a necessary prerequisite, but the sqitch.plan format bug and release-phase pipeline gap meant migrations still did not apply. Resolution: sqitch.plan timestamps corrected (brackets removed), deploy workflow updated to run heroku run sqitch deploy explicitly after container:release, and migrations applied manually via heroku run on 2026-06-17 to restore prod.
Timeline (all times UTC)
- 13:20 — PR #3620 (
fix(queue): copy sqitch migrations into runtime image) CI Docker smoke started on commitdbaab1e. - 13:50 — Docker smoke passed green (30 min, vcpkg full build). Sprint readiness gate failed — known flaky gate, pre-existing vault 403. All real gates green.
- 13:50 — Admin-merge of #3620 (squash). Merge commit
2f875326. - 13:51 —
deploy-queue.yml workflow_dispatchtriggered for prod (target=prod, confirm=deploy-prod-now, ref=main). - 14:18 — build-test completed (27 min, vcpkg cache miss on workflow_dispatch).
- 14:18 — build-container started (Docker image rebuild with correct migrations COPY).
- 14:43 — build-container completed. deploy-prod started.
- 14:44 — v15 released to raxx-queue-prod (image
8d5a539a3ef3). Dyno restarted and came up clean./health200. - 14:50 —
\dt billing_*query: "Did not find any tables named billing_*". Migrations did not apply. Investigation begins. - 14:52 —
heroku releases:output v15: "no release output available". Release phase output_stream_url is null. Heroku API confirms release status=succeeded but no release dyno ran. - 14:55 — Confirmed: Heroku Container Registry deploys (
heroku container:release) do not executeheroku.ymlrelease phase. The two are separate deploy pipelines. - 15:00 —
heroku run sqitch deploy --verify --chdir /app/migrations/sqitch db:pg:$DATABASE_URLfails: "Syntax error in sqitch.plan at line 5: Invalid name; names must not begin with punctuation..." - 15:03 — Root cause identified: timestamps in sqitch.plan are wrapped in
[...]. Sqitch treats[...]after a change name as a dependency list. Without a parsed timestamp, the line fails validation and sqitch exits non-zero. - 15:05 — Sed one-liner fixes sqitch.plan in-place in the running container. Sqitch re-run with
db:pg://URI (converted frompostgres://DATABASE_URL). All 6 migrations deployed and verified. - 15:06 —
\dt billing_*confirms 6 tables.\dt processed_stripe_eventsconfirms 7th table. - 15:07 — Test webhook with valid HMAC signature sent to
POST /api/v1/billing/webhook. HTTP 200{"received":true}. Event recorded inprocessed_stripe_events. - 15:08 — SEV-2 resolved. Fix-forward PR #3635 opened for sqitch.plan + workflow.
Impact
- Users affected: 0 (pre-launch; no real customers)
- User-visible symptoms: none (Stripe webhooks were failing server-side with no customer impact)
- Data integrity: ok (no data was written, no corruption; schema is now clean)
- Revenue / billing: Queue billing was non-functional from launch (v11 deploy 2026-06-16T23:20Z) through resolution (2026-06-17T15:07Z) — ~15h47m window. No real charges were processed; test events only.
What went well
- PR #3620 correctly identified and fixed the
.dockerignoreexclusion formigrations/— a necessary prerequisite. - The
heroku runpath was available as a manual migration channel without requiring a new deploy. - Sqitch's error message was precise enough to diagnose the plan format bug in ~3 minutes.
- The webhook HMAC verification code was correct — once tables existed, the first signed test event succeeded.
- No data corruption from the partial-schema state. The app's C++ code returned 500 only after signature verification passed, so no unsigned events were accepted.
What didn't go well
- The sqitch.plan file was hand-authored with an incorrect timestamp format (brackets). No local sqitch validation ran before merge.
- The
heroku.ymlrelease phase assumption was wrong for Container Registry deploys. No documentation in the repo mentioned this limitation until now. - The deploy workflow had no post-deploy schema verification step. A
\dt billing_*check after release would have caught this immediately. - Two separate bugs (plan format + pipeline mismatch) compounded — either one alone would have prevented migration application, so fixing only one would still fail.
- The
db:pg://vspostgres://URI difference for sqitch was not documented in the repo.
Root cause analysis
-
Contributing factor 1: sqitch.plan timestamp brackets —
queue/migrations/sqitch/sqitch.planwas hand-authored with timestamps wrapped in[TIMESTAMP]. In the sqitch plan format,[...]after a change name denotes a dependency list (requires/conflicts), not a timestamp. The timestamp must appear without brackets. Sqitch 1.3.1 parses the bracketed value as a dependency name containing:andT, which are not valid in dependency names... but the actual failure path is: with[...]parsed as dependencies,$params{yr}(year) is never captured; the$linestill contains aYYYY-MM-DDstring; the error check!$params{yr} && $line =~ $ts_refires. Sqitch reported "Invalid name" rather than a more specific timestamp error, delaying diagnosis. -
Contributing factor 2: heroku.yml release phase not honored by container:release — The
heroku.ymlrelease:command is only executed when deploying viagit push heroku main(the heroku.yml build pipeline). When deploying viaheroku container:push+heroku container:release, Heroku does not readheroku.ymlfor release commands. The deploy workflow usedheroku container:releasefrom day one. The release phase never executed on any deploy. This is a Heroku platform constraint that is not prominently documented. -
Contributing factor 3: no post-deploy schema verification — The deploy workflow had a
/healthcheck but no schema verification step. A\dt billing_*query after release would have surfaced the absence of tables immediately and triggered investigation before go-live. -
Contributing factor 4: sqitch.conf target uses unexpanded $DATABASE_URL —
sqitch.confhastarget = db:pg:$DATABASE_URL. Sqitch does not expand shell variables from config files. When the command line specifies the URI, it must also be expanded. Theheroku.ymlrelease commandsqitch deploy ... db:pg:$DATABASE_URLrelied on shell expansion inside the container, which worked in theory — but since the release phase never executed, this was never tested.
Detection
- What alerted us: manual
\dt billing_*query during go-live verification sequence. - How long between cause and detection: ~15h47m (v11 deployed 2026-06-16T23:20 UTC; schema absence confirmed 2026-06-17T14:50 UTC after v15 deploy).
- How to detect faster next time: add a post-deploy schema verification step to
deploy-queue.ymlthat queriesbilling_*tables and fails the workflow if any are absent.
Resolution
- Manual:
heroku run "sed -i '...' /app/migrations/sqitch/sqitch.plan && sqitch deploy --verify ..."on raxx-queue-prod applied all 6 migrations. - Durable fix: PR #3635 — removes brackets from sqitch.plan timestamps; adds
heroku run sqitch deploysteps to deploy-queue.yml for both staging and prod aftercontainer:release.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Merge PR #3635 (sqitch.plan format + workflow sqitch step) | operator | 2026-06-18 | #3635 |
| 2 | Add post-deploy \dt billing_* verification step to deploy-queue.yml |
sre-agent (in #3635) | 2026-06-18 | #3635 |
| 3 | Add sqitch plan format validation to queue-docker-smoke.yml (parse sqitch.plan and verify all change names have timestamps) | sre-agent | 2026-06-20 | TBD |
| 4 | Document Heroku Container Registry + heroku.yml limitation in queue runbook | sre-agent | 2026-06-20 | TBD |
| 5 | Verify staging billing schema via same heroku run path (raxx-queue-staging) | sre-agent | 2026-06-18 | TBD |
References
- Fix PR: https://github.com/raxx-app/TradeMasterAPI/pull/3635
- Prerequisite PR: https://github.com/raxx-app/TradeMasterAPI/pull/3620
- Prior C++ build failure RCA:
docs/ops/incidents/2026-06-17-queue-ci-cpp-build-failures.md - Heroku Container Registry docs:
https://devcenter.heroku.com/articles/container-registry-and-runtime - Sqitch plan format:
https://sqitch.org/docs/manual/sqitchplan/