RCA — CI baseline degradation blocking merge queue (5 PRs, 5 failure classes)
Incident ID: 2026-05-13-ci-baseline-degradation-merge-queue Date: 2026-05-13 Severity: SEV-2 Duration: ~27h total (first blocked PR opened 2026-05-13T16:46Z → fix PRs open 2026-05-13T20:15Z; merge pending CI) Blast radius: All 5 open PRs blocked (#1998, #1999, #2000, #2001, #2002); no user-facing production impact Author: sre-agent
Summary
Five CI failure classes accumulated on main over several weeks, all pre-existing before any of the five blocked PRs were opened. With 9 days to v1 launch (2026-05-23 UTC), the blocked merge queue was classified SEV-2. Root causes: gitleaks FP allowlist missed three subdirectory path classes and two pre-restructure test paths; the backend-tests-postgres job was never updated to create the raptor_app database role that ci-pr.yml had already patched; the Velvet editable install fails under newer setuptools due to package_dir = velvet = . indirection; console tests transitively import numpy through the Raptor app factory; and a test-local session expiry used modulo arithmetic that produces past timestamps after 16:00 UTC. Two fix PRs (#2004 gitleaks, #2006 CI fixes) address all five classes.
Timeline (all times UTC)
- 2026-05-13 ~16:46 — PR #1998 opened; CI runs show 5 failures: gitleaks, Postgres, console tests, OpenAPI drift, Velvet
- 2026-05-13 ~16:55 — PRs #1999, #2000, #2001, #2002 opened; all show same failure pattern
- 2026-05-13 ~19:33 — PR #2004 (gitleaks FP allowlist) opened and pushed
- 2026-05-13 ~19:33 — PR #2003 (pre-launch punch list) opened (incidental, not related to block)
- 2026-05-13 ~20:00 — SRE investigation begins; root cause diagnosis for all 5 failure classes completed
- 2026-05-13 ~20:15 — PR #2006 (CI baseline fixes) opened
- 2026-05-13 ~20:30 — Velvet install failure discovered on #2006 CI run (editable install PEP 517 egg-info assertion); fix committed (non-editable install)
- Pending — #2004 and #2006 merge; rebase #1998–#2002
Impact
- Users affected: 0 (pre-launch)
- User-visible symptoms: none (all PRs are internal infrastructure or security fixes)
- Data integrity: ok
- Revenue / billing: ok (pre-launch)
What went well
- All 5 root causes were correctly diagnosed before attempting any fix
ci-pr.ymlalready had theraptor_approle fix; the same pattern was ported toci.yml- Gitleaks full-history validation locally (gitleaks 8.30.1) confirmed 0 leaks before pushing #2004
- OpenAPI drift check passed locally before pushing #2006
- The session expiry bug was time-of-day-dependent; diagnosing from code rather than intermittent CI failures avoided chasing a flaky-test red herring
What didn't go well
ci.yml(backend-tests-postgresjob) andci-pr.yml(migration gate) were patched at different times; theraptor_approle fix fromci-pr.ymlwas never backported toci.yml- No automated check verifies that the two Postgres CI jobs (in
ci.ymlandci-pr.yml) have equivalent pre-migration setup steps - Gitleaks FP allowlist uses both path-glob and regex approaches; new path classes (
docs/ops/sre-reports/,docs/ops/triage/) fell outside the existing depth-1 glob^docs/ops/[^/]+\.md$without anyone noticing - Velvet editable install (
pip install -e velvet/) was present inci.ymlsince at least PR #1129; it worked until setuptools behaviour changed and the egg-info assertion tightened; no CI test verified the install path after a setuptools upgrade test_deploy_audit_events_741.pyusednow.replace(hour=(now.hour + 8) % 24)— a clock-wrapping calculation that only fails after 16:00 UTC; the bug survived code review because it looks superficially correct
Root cause analysis
- Contributing factor 1: Dual Postgres CI jobs with divergent pre-migration setup —
ci.ymlandci-pr.ymlboth have ephemeral Postgres services.ci-pr.ymlreceived araptor_approle creation step when migration 0002 was written;ci.ymldid not. No process required both files to be updated together when migration 0002 landed. - Contributing factor 2: Gitleaks allowlist depth-1 glob — The
#1819fix added^docs/ops/[^/]+\.md$(depth-1 only). New ops subdirectories (sre-reports/,triage/) accumulated findings at depth-2 with no automated alert until they blocked PRs. - Contributing factor 3: Velvet package_dir indirection with newer setuptools — The
velvet = .mapping insetup.cfgworks forpip install .(non-editable) but the setuptools PEP 517 editable backend asserts exactly one.egg-infodirectory and fails when the mapping causesfind_packages()to produce 0 results. No version pin on setuptools in CI. - Contributing factor 4: Console test job missing transitive numpy dependency — The console test job installs only
console/requirements.txt, but some console tests construct a Raptor app fixture (_make_raptor_app()) which importsbacktest.py→numpy. This import was always present; the test job was never updated afterbacktest.pygained the numpy dependency. - Contributing factor 5: Clock-wrapping session expiry in test fixture —
now.replace(hour=(now.hour + 8) % 24)silently inverts the session: after 16:00 UTC the result is a time earlier in the same day, making the session appear already expired.write_audit()compounding the failure by callingrequest.headers.get()outside a request context and silently swallowing the resultingRuntimeErrorcaused the audit assertion to fail on 0 rows.
Detection
- What alerted us: Operator report — 5 blocked PRs with CI failures
- How long between cause and detection: weeks (gitleaks FPs accumulated since the
#1819merge ~2026-05-09; Postgres issue since migration 0002 landing; Velvet since setuptools upgrade; console numpy since backtest.py gained numpy) - How to detect faster next time: see action items
Resolution
- PR #2004 — Extends
.gitleaks.tomlallowlist with 3 regex patterns and 6 path entries covering all 5 confirmed FP classes. Validated locally:gitleaks 8.30.1 --no-git=falsefull-history scan → 0 leaks. - PR #2006 — Five-fix bundle in
ci.yml,ci-console.yml,backend_v2/api/openapi.yaml,console/app/services/audit.py,console/tests/test_deploy_audit_events_741.py. ci.yml:CREATE ROLE raptor_app NOLOGINstep before Alembic; non-editablepip install .for Velvetci-console.yml:pip install -r ../backend_v2/requirements.txtin console test jobopenapi.yaml: 13 missing routes documented (drift check: 0 issues)audit.py:has_request_context()guards onrequest.headersand_get_ip()test_deploy_audit_events_741.py:expires_at=now + timedelta(hours=8)- Validation: OpenAPI drift check passes locally; gitleaks scan passes locally; no new regressions expected in console tests
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add a lint-rule or comment-gate requiring both ci.yml and ci-pr.yml Postgres setup sections to be updated atomically when a new migration adds a GRANT |
operator | 2026-05-20 | file after #2006 merges |
| 2 | Add gitleaks CI step that also runs on push to main (not just PRs) so FP accumulation is caught at merge time, not blocking the next PR | sre-agent | 2026-05-20 | file after #2006 merges |
| 3 | Pin setuptools version in the Velvet CI job (or add a pip install setuptools>=X guard) to catch setuptools regressions in a dedicated dependency-audit PR rather than breaking smoke tests |
operator | 2026-05-20 | file after #2006 merges |
| 4 | Add timedelta as a required import to the console test fixture linting pass (advisory ruff rule) to prevent future clock-arithmetic bugs in test session helpers |
operator | 2026-05-23 | file after #2006 merges |
| 5 | Merge #2004 (gitleaks), then #2006 (CI fixes), then rebase and merge blocked PRs in sequence: #2002, #2001, #1998, #1999, #2000 | operator | 2026-05-13 | — |
References
- Fix PRs: #2004 (gitleaks), #2006 (CI baseline)
- Blocked PRs: #1998, #1999, #2000, #2001, #2002
- Gitleaks prior fix: #1819
- raptor_app role in ci-pr.yml: lines 452-459
- Related incident:
docs/incidents/2026-05-12-ci-pipeline-triple-failure.md - Sentry: no Sentry involvement (CI infrastructure failure, not application error)