System: console — Heroku log-drain alerting service
Issue: #1345
Incident that prompted this: docs/incidents/2026-05-07-console-prod-worker-thrash.md
Implementation: Option B — Heroku log drain webhook (Flask endpoint + Slack DM)
Flag: FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING (default off)
A Heroku log drain delivers syslog log lines from raxx-console-prod to the
endpoint POST /api/internal/heroku-log-drain in the console app. The
endpoint parses each line for:
| Pattern | Threshold | Window | Alert destination |
|---|---|---|---|
code=H12 |
> 3 events | 5 minutes | Operator Slack DM (D0AJ7K184TV) |
WORKER TIMEOUT |
> 2 events | 2 minutes | Operator Slack DM (D0AJ7K184TV) |
Both alert types include:
- A description of the breach (count, window, threshold)
- App name (raxx-console-prod)
- UTC timestamp
- Runbook link: docs/ops/runbooks/console-status-poller.md — Failure mode H
A 5-minute cooldown per alert type prevents duplicate messages during a sustained storm.
python3 -c "import secrets; print(secrets.token_hex(32))"
Store the output as HEROKU_LOG_DRAIN_HMAC_SECRET in Infisical at
/MooseQuest/console-prod/HEROKU_LOG_DRAIN_HMAC_SECRET and in SSM
(/raxx/console-prod/HEROKU_LOG_DRAIN_HMAC_SECRET) per
feedback_aws_workloads_use_ssm_not_vault.md.
SECRET=$(infisical secrets get HEROKU_LOG_DRAIN_HMAC_SECRET \
--path /MooseQuest/console-prod --env prod --plain)
heroku config:set HEROKU_LOG_DRAIN_HMAC_SECRET="$SECRET" \
--app raxx-console-prod >/dev/null 2>&1
heroku config:set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true \
--app raxx-console-prod >/dev/null 2>&1
The drain URL uses ?drain_token=<HMAC(secret, logplex_token)>. You cannot
know the Logplex-Drain-Token until after the drain is registered, so this is a
two-step process:
Step A — Register a placeholder drain first to get the Logplex-Drain-Token:
heroku drains:add \
"https://console.raxx.app/api/internal/heroku-log-drain?drain_token=placeholder" \
--app raxx-console-prod
Heroku will assign a Logplex token. Retrieve it:
heroku drains --app raxx-console-prod
# Output:
# https://console.raxx.app/api/internal/heroku-log-drain?drain_token=placeholder (d.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
The token in parentheses (e.g. d.abc123...) is the Logplex-Drain-Token.
Step B — Compute the real drain_token and update the drain URL:
LOGPLEX_TOKEN="d.abc123..." # paste the token from the output above
SECRET=$(heroku config:get HEROKU_LOG_DRAIN_HMAC_SECRET --app raxx-console-prod)
DRAIN_TOKEN=$(python3 -c "
import hashlib, hmac
print(hmac.new('$SECRET'.encode(), '$LOGPLEX_TOKEN'.encode(), hashlib.sha256).hexdigest())
")
echo "drain_token=$DRAIN_TOKEN"
Step C — Remove the placeholder drain and register with the correct token:
# Find the drain ID first
heroku drains --app raxx-console-prod
# Remove the placeholder
heroku drains:remove <drain_id> --app raxx-console-prod
# Register with the correct drain_token
heroku drains:add \
"https://console.raxx.app/api/internal/heroku-log-drain?drain_token=$DRAIN_TOKEN" \
--app raxx-console-prod
# Tail logs and look for the drain receipt line
heroku logs --tail --app raxx-console-prod | grep heroku_log_drain
Expected output when a log line is delivered:
app[web.1]: heroku_log_drain: processed N lines from logplex delivery
If you see 401 in Heroku's drain delivery log, the HMAC token is wrong —
re-run Step B with the correct LOGPLEX_TOKEN.
Run manually from the console machine or a developer workstation with access:
./scripts/agent-tools/run_smoke.sh
For a targeted drain smoke test, use curl to POST a synthetic H12 line to the
endpoint. First, compute a valid drain_token for a test Logplex token:
SECRET=$(heroku config:get HEROKU_LOG_DRAIN_HMAC_SECRET --app raxx-console-prod)
TEST_LOGPLEX="t.test-smoke-$(date +%s)"
DRAIN_TOKEN=$(python3 -c "
import hashlib, hmac
print(hmac.new('$SECRET'.encode(), '$TEST_LOGPLEX'.encode(), hashlib.sha256).hexdigest())
")
curl -sS -X POST \
"https://console.raxx.app/api/internal/heroku-log-drain?drain_token=$DRAIN_TOKEN" \
-H "Content-Type: application/logplex-1" \
-H "Logplex-Drain-Token: $TEST_LOGPLEX" \
-d '2026-05-07T08:06:12Z heroku[router]: at=error code=H12 desc="Request timeout"' \
-w "\nHTTP %{http_code}\n"
Expected: HTTP 204
To trigger a full alert, send 4+ H12 lines in rapid succession and check for a Slack DM in the operator channel.
This implementation uses Option B (Heroku log drain webhook). The decision
rationale:
- The console app has no Sentry SDK installed (Sentry integration is via the
sentry_backend flag in Raptor, not Antlers/console).
- Heroku's built-in H12 metric threshold alerts (Option C) don't support
WORKER TIMEOUT pattern matching (a Gunicorn-level log event, not a Heroku
router error code).
- Option B reuses existing slack_notify.py infrastructure and requires no
new dependencies.
heroku config:set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=false \
--app raxx-console-prod >/dev/null 2>&1
The drain endpoint returns 501 when the flag is off. Heroku will log drain delivery failures but the drain remains registered for when you re-enable.
| Symptom | Likely cause | Fix |
|---|---|---|
| No alerts firing despite H12 storm | Flag off, Slack token missing, drain not registered, wrong drain_token | Check flag, SLACK_BOT_TOKEN, drain URL HMAC |
HTTP 401 from drain delivery |
drain_token in URL doesn't match HMAC(secret, logplex_token) |
Re-compute drain_token and re-register drain |
HTTP 501 from drain delivery |
FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=false or HEROKU_LOG_DRAIN_HMAC_SECRET unset |
Set flag to true, set secret |
| Alert fires for non-prod apps | Log drain is registered against the wrong Heroku app | Only register on raxx-console-prod |
| Duplicate alerts every 5+ minutes | Cooldown period expired; storm is sustained | Normal; expected behavior for a sustained H12 storm |
docs/incidents/2026-05-07-console-prod-worker-thrash.mddocs/ops/runbooks/console-status-poller.mdconsole/app/services/slack_notify.pyconsole/app/services/heroku_log_drain_alert.pyconsole/app/blueprints/heroku_log_drain.py