Raxx · internal docs

internal · gated ↑ index

Runbook — console-prod H12 + WORKER TIMEOUT Slack alerting

System: console — Heroku log-drain alerting service Issue: #1345 Incident that prompted this: docs/incidents/2026-05-07-console-prod-worker-thrash.md Implementation: Option B — Heroku log drain webhook (Flask endpoint + Slack DM) Flag: FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING (default off)


What this does

A Heroku log drain delivers syslog log lines from raxx-console-prod to the endpoint POST /api/internal/heroku-log-drain in the console app. The endpoint parses each line for:

Pattern Threshold Window Alert destination
code=H12 > 3 events 5 minutes Operator Slack DM (D0AJ7K184TV)
WORKER TIMEOUT > 2 events 2 minutes Operator Slack DM (D0AJ7K184TV)

Both alert types include: - A description of the breach (count, window, threshold) - App name (raxx-console-prod) - UTC timestamp - Runbook link: docs/ops/runbooks/console-status-poller.md — Failure mode H

A 5-minute cooldown per alert type prevents duplicate messages during a sustained storm.


First-time setup (operator action required after merge)

1. Generate a drain HMAC secret

python3 -c "import secrets; print(secrets.token_hex(32))"

Store the output as HEROKU_LOG_DRAIN_HMAC_SECRET in Infisical at /MooseQuest/console-prod/HEROKU_LOG_DRAIN_HMAC_SECRET and in SSM (/raxx/console-prod/HEROKU_LOG_DRAIN_HMAC_SECRET) per feedback_aws_workloads_use_ssm_not_vault.md.

2. Set env vars on the Heroku app

SECRET=$(infisical secrets get HEROKU_LOG_DRAIN_HMAC_SECRET \
  --path /MooseQuest/console-prod --env prod --plain)
heroku config:set HEROKU_LOG_DRAIN_HMAC_SECRET="$SECRET" \
  --app raxx-console-prod >/dev/null 2>&1
heroku config:set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=true \
  --app raxx-console-prod >/dev/null 2>&1

3. Compute the drain token (one-time)

The drain URL uses ?drain_token=<HMAC(secret, logplex_token)>. You cannot know the Logplex-Drain-Token until after the drain is registered, so this is a two-step process:

Step A — Register a placeholder drain first to get the Logplex-Drain-Token:

heroku drains:add \
  "https://console.raxx.app/api/internal/heroku-log-drain?drain_token=placeholder" \
  --app raxx-console-prod

Heroku will assign a Logplex token. Retrieve it:

heroku drains --app raxx-console-prod
# Output:
# https://console.raxx.app/api/internal/heroku-log-drain?drain_token=placeholder (d.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

The token in parentheses (e.g. d.abc123...) is the Logplex-Drain-Token.

Step B — Compute the real drain_token and update the drain URL:

LOGPLEX_TOKEN="d.abc123..."   # paste the token from the output above
SECRET=$(heroku config:get HEROKU_LOG_DRAIN_HMAC_SECRET --app raxx-console-prod)
DRAIN_TOKEN=$(python3 -c "
import hashlib, hmac
print(hmac.new('$SECRET'.encode(), '$LOGPLEX_TOKEN'.encode(), hashlib.sha256).hexdigest())
")
echo "drain_token=$DRAIN_TOKEN"

Step C — Remove the placeholder drain and register with the correct token:

# Find the drain ID first
heroku drains --app raxx-console-prod
# Remove the placeholder
heroku drains:remove <drain_id> --app raxx-console-prod
# Register with the correct drain_token
heroku drains:add \
  "https://console.raxx.app/api/internal/heroku-log-drain?drain_token=$DRAIN_TOKEN" \
  --app raxx-console-prod

4. Verify the drain is delivering

# Tail logs and look for the drain receipt line
heroku logs --tail --app raxx-console-prod | grep heroku_log_drain

Expected output when a log line is delivered:

app[web.1]: heroku_log_drain: processed N lines from logplex delivery

If you see 401 in Heroku's drain delivery log, the HMAC token is wrong — re-run Step B with the correct LOGPLEX_TOKEN.


Smoke test (after setup)

Run manually from the console machine or a developer workstation with access:

./scripts/agent-tools/run_smoke.sh

For a targeted drain smoke test, use curl to POST a synthetic H12 line to the endpoint. First, compute a valid drain_token for a test Logplex token:

SECRET=$(heroku config:get HEROKU_LOG_DRAIN_HMAC_SECRET --app raxx-console-prod)
TEST_LOGPLEX="t.test-smoke-$(date +%s)"
DRAIN_TOKEN=$(python3 -c "
import hashlib, hmac
print(hmac.new('$SECRET'.encode(), '$TEST_LOGPLEX'.encode(), hashlib.sha256).hexdigest())
")
curl -sS -X POST \
  "https://console.raxx.app/api/internal/heroku-log-drain?drain_token=$DRAIN_TOKEN" \
  -H "Content-Type: application/logplex-1" \
  -H "Logplex-Drain-Token: $TEST_LOGPLEX" \
  -d '2026-05-07T08:06:12Z heroku[router]: at=error code=H12 desc="Request timeout"' \
  -w "\nHTTP %{http_code}\n"

Expected: HTTP 204

To trigger a full alert, send 4+ H12 lines in rapid succession and check for a Slack DM in the operator channel.


Implementation decision (Option B selected)

This implementation uses Option B (Heroku log drain webhook). The decision rationale: - The console app has no Sentry SDK installed (Sentry integration is via the sentry_backend flag in Raptor, not Antlers/console). - Heroku's built-in H12 metric threshold alerts (Option C) don't support WORKER TIMEOUT pattern matching (a Gunicorn-level log event, not a Heroku router error code). - Option B reuses existing slack_notify.py infrastructure and requires no new dependencies.


Turning the alert off

heroku config:set FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=false \
  --app raxx-console-prod >/dev/null 2>&1

The drain endpoint returns 501 when the flag is off. Heroku will log drain delivery failures but the drain remains registered for when you re-enable.


Troubleshooting

Symptom Likely cause Fix
No alerts firing despite H12 storm Flag off, Slack token missing, drain not registered, wrong drain_token Check flag, SLACK_BOT_TOKEN, drain URL HMAC
HTTP 401 from drain delivery drain_token in URL doesn't match HMAC(secret, logplex_token) Re-compute drain_token and re-register drain
HTTP 501 from drain delivery FLAG_CONSOLE_HEROKU_LOG_DRAIN_ALERTING=false or HEROKU_LOG_DRAIN_HMAC_SECRET unset Set flag to true, set secret
Alert fires for non-prod apps Log drain is registered against the wrong Heroku app Only register on raxx-console-prod
Duplicate alerts every 5+ minutes Cooldown period expired; storm is sustained Normal; expected behavior for a sustained H12 storm

References