RCA — Stuck Flag Promotion: console_heroku_log_drain_alerting

Incident ID: 2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting Date: 2026-05-11 Severity: SEV-3 Duration: ~97h total (2026-05-07 00:00 UTC marked → 2026-05-11 14:21 UTC resolved) Blast radius: Operator UX only — the promotion queue UI was stuck showing "deploying..." for 4 days. The flag itself was already live in prod. No user-facing impact. No data at risk. Author: sre-agent

Summary

The flag console_heroku_log_drain_alerting was marked for promotion on 2026-05-07 00:00 UTC and manually promoted (TOTP-gated) on 2026-05-09 ~04:28 UTC. The Heroku PATCH to set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true on raxx-console-prod succeeded — the value was confirmed true in prod. However the background daemon thread (promote_async / _do_heroku_work) was interrupted before it could write state=deployed and deployed_at back to the DB. The promotion row stayed in state=deploying indefinitely, with no timeout or watchdog to detect or resolve the stuck state. The operator surfaced this on 2026-05-11 ~14:18 UTC. Remediation was a manual UPDATE via heroku run, setting state=deployed and deployed_at=2026-05-11 14:21:43 UTC, with an audit row written to console.flag.promotion.ops_complete.

Timeline (all times UTC)

2026-05-07 00:00 — Promotion row created (state=pending). Flag: console_heroku_log_drain_alerting, target: raxx-console-prod, value: on.
2026-05-09 04:28 — Operator triggered the promote action (TOTP-gated). Row transitioned to state=deploying. Background thread (promote-async-1) started.
2026-05-09 04:28–04:30 (estimated) — Background thread called _do_heroku_work: _heroku_set_config_var completed (PATCH accepted by Heroku). FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true written to raxx-console-prod. Release poll returned True (new Heroku release detected).
2026-05-09 04:28–04:30 (estimated) — Dyno restart triggered by the config-var change. The promote-async-1 daemon thread was terminated mid-execution before db.session.commit() on the state=deployed transition landed.
2026-05-09 04:30 — Row remains state=deploying, deployed_at=NULL. No watchdog or expiry path covers the deploying state, so the row stays stuck.
2026-05-11 14:18 — Operator observed "/console/flags" showing "Deploying..." for ~4 days. Incident opened.
2026-05-11 14:18 — SRE-agent acknowledged. Code path read. heroku config:get confirmed FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true. DB query confirmed state=deploying, deployed_at=NULL, reason does not start with FAILED:.
2026-05-11 14:21 — Manual remediation executed via heroku run: state=deployed, deployed_at=2026-05-11 14:21:43 UTC. Audit row written: console.flag.promotion.ops_complete, id=6dc21e77-89e2-47ab-a63d-cb1fbd75f889.
2026-05-11 14:22 — Resolved.

Impact

Users affected: 0 (internal ops tooling only)
User-visible symptoms: None
Data integrity: ok — the flag value in prod was correct throughout
Revenue / billing: ok

What went well

The Heroku flag write itself succeeded, so the feature (console_heroku_log_drain_alerting) was live in prod for the full 4-day window without issue.
The reason column's absence of a FAILED: prefix was a reliable diagnostic signal that _mark_failed() was never called, pointing clearly to the daemon-thread-interrupted-mid-success failure mode.
heroku config:get confirmed the flag value before any remediation, making the roll-forward decision safe and unambiguous.
The DB schema's CHECK constraint correctly prevented storing state=failed — it surfaced the v1 limitation that deploying is the catch-all stuck state.

What didn't go well

The system has no watchdog or timeout for the deploying state. A promotion can stay stuck indefinitely.
promote_async spawns a daemon thread and returns 202. The dyno restart triggered by the Heroku config-var write is the exact event that terminates daemon threads. The system was guaranteed to lose the thread on the next config-var PATCH unless the db.session.commit() landed before the dyno recycled — a race of a few hundred milliseconds.
There is no alerting on promotions that remain deploying for more than a configurable threshold.
There is no operator-facing action button on the queue UI to manually mark a stuck promotion complete or failed.
The expiry_scheduler correctly handles pending and staged rows but has no logic for deploying rows — they are explicitly excluded from every expiry path.

Root cause analysis

Contributing factor 1: Daemon thread vs. dyno restart race. promote_async sets state=deploying, fires a daemon thread (promote-async-N), and returns 202. The Heroku PATCH inside that thread triggers an immediate dyno restart. Daemon threads are terminated when the process exits. The thread had approximately the restart lead time (typically 10–30 seconds for a graceful shutdown) to finish the release poll + db.session.commit(). In this case the thread lost the race. The system allowed a critical state transition to depend on a daemon thread completing within the same dyno lifecycle as the dyno-restart event it itself triggered.
Contributing factor 2: No deploying timeout or watchdog. The expiry scheduler (promotion_expiry_scheduler.py) processes pending and staged states but entirely skips deploying. The run_maintenance method does not check for deploying rows stuck beyond a threshold. The system allowed deploying to be a permanent terminal state for failed promotions — per the _mark_failed comment: "we leave the row in state='deploying' and store the error in reason."
Contributing factor 3: No operator escape hatch in the UI. The promotions queue UI shows "Deploying..." and a disabled button when state=deploying. There is no manual override action for stuck promotions. The operator had to surface this to L2 on-call rather than self-service.
Contributing factor 4: No alerting on long-running deploying state. A promotion stuck in deploying for >30 minutes should be a SEV-3 signal. No such alert exists.

Detection

What alerted us: Operator visual inspection of /console/flags/promotions — "Deploying..." for 4 days.
Time between cause (2026-05-09 ~04:29) and detection (2026-05-11 14:18): ~57h 49m
How to detect faster next time: implement the timeout watchdog (see action items). Alert after 30 minutes in deploying state.

Resolution

What was changed: console_flag_promotions row id=1 updated from state=deploying, deployed_at=NULL to state=deployed, deployed_at=2026-05-11 14:21:43+00:00. Executed via heroku run on raxx-console-prod. An audit row (console.flag.promotion.ops_complete, id=6dc21e77-89e2-47ab-a63d-cb1fbd75f889) was written to audit_log.
Validation: heroku config:get FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING -a raxx-console-prod returns true. DB re-read via heroku run confirms state=deployed, deployed_at=2026-05-11 14:21:43.011849+00:00.

Action items

#	Action	Owner	Due	Issue
1	Add timeout watchdog to `run_maintenance`: detect `deploying` rows older than 24h, surface via Slack CRIT	feature-developer	2026-05-25	#1639
2	Add 12h (warning) and 24h (CRIT) Slack alerts for promotions stuck in `deploying`	feature-developer	2026-05-25	#1639
3	Add operator "Mark completed" / "Mark failed" buttons on `/console/flags/promotions` for `deploying`-state rows (superadmin + TOTP)	feature-developer	2026-05-25	#1639
4	Add `rca_url` column to `console_flag_promotions` (nullable text, next migration)	feature-developer	2026-06-01	#1639
5	Add `PENDING_COMMIT:` checkpoint write in `_do_heroku_work` after successful Heroku PATCH to survive daemon-thread kill	feature-developer	2026-05-25	#1639

References

Runbook: docs/ops/runbooks/heroku.md — Failure mode F (updated this incident)
Flag definition: backend_v2/api/feature_flags.yaml — console_heroku_log_drain_alerting
Service code: console/app/services/promotions.py — promote_async, _do_heroku_work, _mark_failed
Scheduler: console/app/services/promotion_expiry_scheduler.py
Issue #1639 — watchdog feature card
Issue #1519 (bounded-poll spinner fix that introduced promote_async)
Issue #806 (expiry scheduler — covers pending/staged but not deploying)