RCA — Queue staging deploy blocked by sqitch registry mismatch

Incident ID: 2026-06-23-queue-staging-sqitch-registry-mismatch Date: 2026-06-23 Severity: SEV-2 Duration: ~45m (detection via CI alert → PR merged) Blast radius: Queue staging deploy pipeline blocked; prod unaffected (prod deploy is manual-only) Author: sre-agent

Summary

The Queue auto-deploy to staging (run 28057641389) failed because billing_customer already existed on the staging DB but was not in Sqitch's registry. Sqitch tried to re-run 01-billing-schema and hit ERROR: relation "billing_customer" already exists. The root fix was to make all sqitch deploy scripts idempotent (CREATE TABLE IF NOT EXISTS, etc.) so re-execution is safe regardless of prior state. PR #3788 carries the fix.

Timeline (all times UTC)

22:03 — deploy-queue.yml run 28057641389 starts; container:release to raxx-queue-staging succeeds.
22:03 — "Run sqitch migrations" step starts via heroku run.
22:03 — Sqitch connects to staging DB, attempts 01-billing-schema, receives ERROR: relation "billing_customer" already exists at line 44 of the deploy script.
22:03 — Sqitch exits 3; heroku run propagates exit code 2; CI job fails.
22:03 — "Deploy summary" job posts failure comment; "Deploy to production" skipped correctly.
22:35 — sre-agent triages run 28057641389, confirms root cause.
22:45 — PR #3788 opened (fix/queue-sqitch-01-idempotent): idempotent DDL in scripts 01-06.
22:45 — PR #3638 rebased onto main (was BLOCKED, needed rebase, no code conflict).
22:50 — PR #3615 rebased onto main (was CONFLICTING, 3 files, resolved).

Impact

Users affected: 0 (Queue billing is not yet live for customers; staging only)
User-visible symptoms: none
Data integrity: ok (no data was modified; tables were already correct)
Revenue / billing: ok

What went well

Failure was caught at CI gate, not at prod runtime.
The existing deploy-queue.yml "Deploy to production" step correctly skips when staging fails — prod was never at risk.
Run log was complete and unambiguous; root cause identified in under 5 minutes.
The heroku run --exit-code flag propagated the sqitch exit code correctly.

What didn't go well

Sqitch deploy scripts used bare CREATE TABLE without IF NOT EXISTS — a re-entrant deploy (common operational scenario) was guaranteed to fail.
The staging DB accumulated schema state outside Sqitch's registry (likely from a prior container:release where sqitch itself succeeded but its registry row was lost in a crash, OR from a prior partial manual apply). No runbook entry existed for this failure mode.
Neither the Docker smoke job nor any pre-merge gate checks sqitch idempotency.

Root cause analysis

Contributing factor 1: non-idempotent deploy scripts — deploy/01-billing-schema.sql used CREATE TABLE billing_customer without IF NOT EXISTS. Sqitch's behavior on a registry mismatch is to re-run the deploy script verbatim. The 42P07 duplicate_table error then exits the script non-zero, which Sqitch propagates as a deploy failure. This is the proximate cause.
Contributing factor 2: staging registry drift — The staging DB had billing_customer and its sibling tables, but Sqitch's sqitch.changes table did not contain any of migrations 01-07. This means either: (a) a prior sqitch deploy succeeded for the DDL but the transaction that writes the registry row was not committed (a crash mid-deploy), or (b) the tables were applied via a direct psql run outside Sqitch. The system allowed this drift because there was no monitoring for registry-vs-schema consistency.
Contributing factor 3: no documented remediation — The queue runbook (Failure mode A-H) had no entry for this failure pattern. Sre-agent had to diagnose from first principles rather than following a documented path.

Detection

What alerted us: deploy-queue-failure-monitor.yml would have fired after N consecutive failures; in this case, direct CI inspection on operator request detected it first.
Time between cause and detection: <1 minute (CI shows real-time log output).
How to detect faster next time: Failure mode I now documented in runbook. The existing streak monitor will alert on repeated failures.

Resolution

What was changed: Deploy scripts 01-06 updated with IF NOT EXISTS / CREATE OR REPLACE / DROP TRIGGER IF EXISTS guards. PR #3788 merged to main; next push to main that touches queue/** will re-trigger staging deploy and Sqitch will succeed.
Validation: CI checks on PR #3788 pass (Queue Docker smoke, PR Gates, PII scan). After merge, the next staging deploy run is the authoritative confirmation — verify with gh run view --json conclusion.

Action items

#	Action	Owner	Due	Issue
1	Merge PR #3788 (idempotent sqitch deploy scripts)	operator	2026-06-23	#3788
2	Merge PR #3638 (queue_customer_id nullable) after staging re-deploy confirms green	operator	2026-06-24	#3638
3	Add sqitch registry-vs-schema consistency check to nightly BCP smoke	sre-agent	2026-06-30	(file follow-up)

References

Runbook: docs/ops/runbooks/queue.md (Failure mode I)
Failing run: https://github.com/raxx-app/TradeMasterAPI/actions/runs/28057641389
Fix PR: https://github.com/raxx-app/TradeMasterAPI/pull/3788
Related PR: https://github.com/raxx-app/TradeMasterAPI/pull/3638