Raxx · internal docs

internal · gated

RCA — Stuck Flag Promotion: console_heroku_log_drain_alerting

Incident ID: 2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting Date: 2026-05-11 Severity: SEV-3 Duration: ~97h total (2026-05-07 00:00 UTC marked → 2026-05-11 14:21 UTC resolved) Blast radius: Operator UX only — the promotion queue UI was stuck showing "deploying..." for 4 days. The flag itself was already live in prod. No user-facing impact. No data at risk. Author: sre-agent

Summary

The flag console_heroku_log_drain_alerting was marked for promotion on 2026-05-07 00:00 UTC and manually promoted (TOTP-gated) on 2026-05-09 ~04:28 UTC. The Heroku PATCH to set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true on raxx-console-prod succeeded — the value was confirmed true in prod. However the background daemon thread (promote_async / _do_heroku_work) was interrupted before it could write state=deployed and deployed_at back to the DB. The promotion row stayed in state=deploying indefinitely, with no timeout or watchdog to detect or resolve the stuck state. The operator surfaced this on 2026-05-11 ~14:18 UTC. Remediation was a manual UPDATE via heroku run, setting state=deployed and deployed_at=2026-05-11 14:21:43 UTC, with an audit row written to console.flag.promotion.ops_complete.

Timeline (all times UTC)

Impact

What went well

What didn't go well

Root cause analysis

Detection

Resolution

Action items

# Action Owner Due Issue
1 Add timeout watchdog to run_maintenance: detect deploying rows older than 24h, surface via Slack CRIT feature-developer 2026-05-25 #1639
2 Add 12h (warning) and 24h (CRIT) Slack alerts for promotions stuck in deploying feature-developer 2026-05-25 #1639
3 Add operator "Mark completed" / "Mark failed" buttons on /console/flags/promotions for deploying-state rows (superadmin + TOTP) feature-developer 2026-05-25 #1639
4 Add rca_url column to console_flag_promotions (nullable text, next migration) feature-developer 2026-06-01 #1639
5 Add PENDING_COMMIT: checkpoint write in _do_heroku_work after successful Heroku PATCH to survive daemon-thread kill feature-developer 2026-05-25 #1639

References