RCA — Stuck Flag Promotion: console_heroku_log_drain_alerting
Incident ID: 2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting Date: 2026-05-11 Severity: SEV-3 Duration: ~97h total (2026-05-07 00:00 UTC marked → 2026-05-11 14:21 UTC resolved) Blast radius: Operator UX only — the promotion queue UI was stuck showing "deploying..." for 4 days. The flag itself was already live in prod. No user-facing impact. No data at risk. Author: sre-agent
Summary
The flag console_heroku_log_drain_alerting was marked for promotion on 2026-05-07 00:00 UTC and manually promoted (TOTP-gated) on 2026-05-09 ~04:28 UTC. The Heroku PATCH to set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true on raxx-console-prod succeeded — the value was confirmed true in prod. However the background daemon thread (promote_async / _do_heroku_work) was interrupted before it could write state=deployed and deployed_at back to the DB. The promotion row stayed in state=deploying indefinitely, with no timeout or watchdog to detect or resolve the stuck state. The operator surfaced this on 2026-05-11 ~14:18 UTC. Remediation was a manual UPDATE via heroku run, setting state=deployed and deployed_at=2026-05-11 14:21:43 UTC, with an audit row written to console.flag.promotion.ops_complete.
Timeline (all times UTC)
- 2026-05-07 00:00 — Promotion row created (state=pending). Flag:
console_heroku_log_drain_alerting, target:raxx-console-prod, value: on. - 2026-05-09 04:28 — Operator triggered the promote action (TOTP-gated). Row transitioned to
state=deploying. Background thread (promote-async-1) started. - 2026-05-09 04:28–04:30 (estimated) — Background thread called
_do_heroku_work:_heroku_set_config_varcompleted (PATCH accepted by Heroku).FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=truewritten toraxx-console-prod. Release poll returned True (new Heroku release detected). - 2026-05-09 04:28–04:30 (estimated) — Dyno restart triggered by the config-var change. The
promote-async-1daemon thread was terminated mid-execution beforedb.session.commit()on thestate=deployedtransition landed. - 2026-05-09 04:30 — Row remains
state=deploying,deployed_at=NULL. No watchdog or expiry path covers thedeployingstate, so the row stays stuck. - 2026-05-11 14:18 — Operator observed "/console/flags" showing "Deploying..." for ~4 days. Incident opened.
- 2026-05-11 14:18 — SRE-agent acknowledged. Code path read.
heroku config:getconfirmedFLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true. DB query confirmedstate=deploying,deployed_at=NULL,reasondoes not start withFAILED:. - 2026-05-11 14:21 — Manual remediation executed via
heroku run:state=deployed,deployed_at=2026-05-11 14:21:43 UTC. Audit row written:console.flag.promotion.ops_complete, id=6dc21e77-89e2-47ab-a63d-cb1fbd75f889. - 2026-05-11 14:22 — Resolved.
Impact
- Users affected: 0 (internal ops tooling only)
- User-visible symptoms: None
- Data integrity: ok — the flag value in prod was correct throughout
- Revenue / billing: ok
What went well
- The Heroku flag write itself succeeded, so the feature (
console_heroku_log_drain_alerting) was live in prod for the full 4-day window without issue. - The
reasoncolumn's absence of aFAILED:prefix was a reliable diagnostic signal that_mark_failed()was never called, pointing clearly to the daemon-thread-interrupted-mid-success failure mode. heroku config:getconfirmed the flag value before any remediation, making the roll-forward decision safe and unambiguous.- The DB schema's
CHECKconstraint correctly prevented storingstate=failed— it surfaced the v1 limitation thatdeployingis the catch-all stuck state.
What didn't go well
- The system has no watchdog or timeout for the
deployingstate. A promotion can stay stuck indefinitely. promote_asyncspawns a daemon thread and returns 202. The dyno restart triggered by the Heroku config-var write is the exact event that terminates daemon threads. The system was guaranteed to lose the thread on the next config-var PATCH unless thedb.session.commit()landed before the dyno recycled — a race of a few hundred milliseconds.- There is no alerting on promotions that remain
deployingfor more than a configurable threshold. - There is no operator-facing action button on the queue UI to manually mark a stuck promotion complete or failed.
- The
expiry_schedulercorrectly handlespendingandstagedrows but has no logic fordeployingrows — they are explicitly excluded from every expiry path.
Root cause analysis
- Contributing factor 1: Daemon thread vs. dyno restart race.
promote_asyncsetsstate=deploying, fires a daemon thread (promote-async-N), and returns 202. The Heroku PATCH inside that thread triggers an immediate dyno restart. Daemon threads are terminated when the process exits. The thread had approximately the restart lead time (typically 10–30 seconds for a graceful shutdown) to finish the release poll +db.session.commit(). In this case the thread lost the race. The system allowed a critical state transition to depend on a daemon thread completing within the same dyno lifecycle as the dyno-restart event it itself triggered. - Contributing factor 2: No
deployingtimeout or watchdog. The expiry scheduler (promotion_expiry_scheduler.py) processespendingandstagedstates but entirely skipsdeploying. Therun_maintenancemethod does not check fordeployingrows stuck beyond a threshold. The system alloweddeployingto be a permanent terminal state for failed promotions — per the_mark_failedcomment: "we leave the row in state='deploying' and store the error in reason." - Contributing factor 3: No operator escape hatch in the UI. The promotions queue UI shows "Deploying..." and a disabled button when
state=deploying. There is no manual override action for stuck promotions. The operator had to surface this to L2 on-call rather than self-service. - Contributing factor 4: No alerting on long-running
deployingstate. A promotion stuck indeployingfor >30 minutes should be a SEV-3 signal. No such alert exists.
Detection
- What alerted us: Operator visual inspection of
/console/flags/promotions— "Deploying..." for 4 days. - Time between cause (2026-05-09 ~04:29) and detection (2026-05-11 14:18): ~57h 49m
- How to detect faster next time: implement the timeout watchdog (see action items). Alert after 30 minutes in
deployingstate.
Resolution
- What was changed:
console_flag_promotionsrow id=1 updated fromstate=deploying, deployed_at=NULLtostate=deployed, deployed_at=2026-05-11 14:21:43+00:00. Executed viaheroku runonraxx-console-prod. An audit row (console.flag.promotion.ops_complete, id=6dc21e77-89e2-47ab-a63d-cb1fbd75f889) was written toaudit_log. - Validation:
heroku config:get FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING -a raxx-console-prodreturnstrue. DB re-read viaheroku runconfirmsstate=deployed,deployed_at=2026-05-11 14:21:43.011849+00:00.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add timeout watchdog to run_maintenance: detect deploying rows older than 24h, surface via Slack CRIT |
feature-developer | 2026-05-25 | #1639 |
| 2 | Add 12h (warning) and 24h (CRIT) Slack alerts for promotions stuck in deploying |
feature-developer | 2026-05-25 | #1639 |
| 3 | Add operator "Mark completed" / "Mark failed" buttons on /console/flags/promotions for deploying-state rows (superadmin + TOTP) |
feature-developer | 2026-05-25 | #1639 |
| 4 | Add rca_url column to console_flag_promotions (nullable text, next migration) |
feature-developer | 2026-06-01 | #1639 |
| 5 | Add PENDING_COMMIT: checkpoint write in _do_heroku_work after successful Heroku PATCH to survive daemon-thread kill |
feature-developer | 2026-05-25 | #1639 |
References
- Runbook:
docs/ops/runbooks/heroku.md— Failure mode F (updated this incident) - Flag definition:
backend_v2/api/feature_flags.yaml—console_heroku_log_drain_alerting - Service code:
console/app/services/promotions.py—promote_async,_do_heroku_work,_mark_failed - Scheduler:
console/app/services/promotion_expiry_scheduler.py - Issue #1639 — watchdog feature card
- Issue #1519 (bounded-poll spinner fix that introduced
promote_async) - Issue #806 (expiry scheduler — covers pending/staged but not deploying)