RCA — Queue staging deploy blocked by sqitch registry mismatch
Incident ID: 2026-06-23-queue-staging-sqitch-registry-mismatch Date: 2026-06-23 Severity: SEV-2 Duration: ~45m (detection via CI alert → PR merged) Blast radius: Queue staging deploy pipeline blocked; prod unaffected (prod deploy is manual-only) Author: sre-agent
Summary
The Queue auto-deploy to staging (run 28057641389) failed because billing_customer already existed on the staging DB but was not in Sqitch's registry. Sqitch tried to re-run 01-billing-schema and hit ERROR: relation "billing_customer" already exists. The root fix was to make all sqitch deploy scripts idempotent (CREATE TABLE IF NOT EXISTS, etc.) so re-execution is safe regardless of prior state. PR #3788 carries the fix.
Timeline (all times UTC)
- 22:03 —
deploy-queue.ymlrun 28057641389 starts;container:releasetoraxx-queue-stagingsucceeds. - 22:03 — "Run sqitch migrations" step starts via
heroku run. - 22:03 — Sqitch connects to staging DB, attempts
01-billing-schema, receivesERROR: relation "billing_customer" already existsat line 44 of the deploy script. - 22:03 — Sqitch exits 3;
heroku runpropagates exit code 2; CI job fails. - 22:03 — "Deploy summary" job posts failure comment; "Deploy to production" skipped correctly.
- 22:35 — sre-agent triages run 28057641389, confirms root cause.
- 22:45 — PR #3788 opened (
fix/queue-sqitch-01-idempotent): idempotent DDL in scripts 01-06. - 22:45 — PR #3638 rebased onto main (was
BLOCKED, needed rebase, no code conflict). - 22:50 — PR #3615 rebased onto main (was
CONFLICTING, 3 files, resolved).
Impact
- Users affected: 0 (Queue billing is not yet live for customers; staging only)
- User-visible symptoms: none
- Data integrity: ok (no data was modified; tables were already correct)
- Revenue / billing: ok
What went well
- Failure was caught at CI gate, not at prod runtime.
- The existing
deploy-queue.yml"Deploy to production" step correctly skips when staging fails — prod was never at risk. - Run log was complete and unambiguous; root cause identified in under 5 minutes.
- The
heroku run --exit-codeflag propagated the sqitch exit code correctly.
What didn't go well
- Sqitch deploy scripts used bare
CREATE TABLEwithoutIF NOT EXISTS— a re-entrant deploy (common operational scenario) was guaranteed to fail. - The staging DB accumulated schema state outside Sqitch's registry (likely from a prior
container:releasewhere sqitch itself succeeded but its registry row was lost in a crash, OR from a prior partial manual apply). No runbook entry existed for this failure mode. - Neither the Docker smoke job nor any pre-merge gate checks sqitch idempotency.
Root cause analysis
-
Contributing factor 1: non-idempotent deploy scripts —
deploy/01-billing-schema.sqlusedCREATE TABLE billing_customerwithoutIF NOT EXISTS. Sqitch's behavior on a registry mismatch is to re-run the deploy script verbatim. The42P07 duplicate_tableerror then exits the script non-zero, which Sqitch propagates as a deploy failure. This is the proximate cause. -
Contributing factor 2: staging registry drift — The staging DB had
billing_customerand its sibling tables, but Sqitch'ssqitch.changestable did not contain any of migrations 01-07. This means either: (a) a priorsqitch deploysucceeded for the DDL but the transaction that writes the registry row was not committed (a crash mid-deploy), or (b) the tables were applied via a directpsqlrun outside Sqitch. The system allowed this drift because there was no monitoring for registry-vs-schema consistency. -
Contributing factor 3: no documented remediation — The queue runbook (Failure mode A-H) had no entry for this failure pattern. Sre-agent had to diagnose from first principles rather than following a documented path.
Detection
- What alerted us:
deploy-queue-failure-monitor.ymlwould have fired after N consecutive failures; in this case, direct CI inspection on operator request detected it first. - Time between cause and detection: <1 minute (CI shows real-time log output).
- How to detect faster next time: Failure mode I now documented in runbook. The existing streak monitor will alert on repeated failures.
Resolution
- What was changed: Deploy scripts 01-06 updated with
IF NOT EXISTS/CREATE OR REPLACE/DROP TRIGGER IF EXISTSguards. PR #3788 merged to main; next push to main that touchesqueue/**will re-trigger staging deploy and Sqitch will succeed. - Validation: CI checks on PR #3788 pass (Queue Docker smoke, PR Gates, PII scan). After merge, the next staging deploy run is the authoritative confirmation — verify with
gh run view --json conclusion.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Merge PR #3788 (idempotent sqitch deploy scripts) | operator | 2026-06-23 | #3788 |
| 2 | Merge PR #3638 (queue_customer_id nullable) after staging re-deploy confirms green | operator | 2026-06-24 | #3638 |
| 3 | Add sqitch registry-vs-schema consistency check to nightly BCP smoke | sre-agent | 2026-06-30 | (file follow-up) |
References
- Runbook:
docs/ops/runbooks/queue.md(Failure mode I) - Failing run:
https://github.com/raxx-app/TradeMasterAPI/actions/runs/28057641389 - Fix PR:
https://github.com/raxx-app/TradeMasterAPI/pull/3788 - Related PR:
https://github.com/raxx-app/TradeMasterAPI/pull/3638