Raxx · internal docs

internal · gated ↑ index

Auto-Ticketing Pipeline — Incident Response Runbook

System: Console auto-ticketing (FreeScout operations mailbox) Owner: operator / sre-agent Last incident: — (new feature; update after first incident) Last reviewed: 2026-05-05

Related runbooks: - Pipeline overview: docs/ops/runbooks/auto-ticketing-pipeline-overview.md - Rollout sequence: docs/ops/runbooks/auto-ticketing-rollout.md - Mailbox provisioning: docs/ops/runbooks/freescout-operations-mailbox-provisioning.md - FreeScout system: docs/ops/runbooks/freescout.md - Feature flag ops: docs/ops/feature-flags-runbook.md

How to tell the pipeline is broken

Ticket flood: FreeScout operations mailbox shows many duplicate tickets for the same surface or alert, all created within the same polling window.
Tickets not creating: FLAG_CONSOLE_ALERTS_AUTO_TICKET is on, P2+ alerts are visible in Heroku logs, but no tickets appear in FreeScout. Or: Investigate button is visible (flag on) but clicking it returns an error or produces no ticket.
Stale Investigate ticket: Operator clicked Investigate hours ago. The surface recovered, but the FreeScout ticket is still open.
FreeScout outage: Neither Investigate nor the alert pipeline creates tickets; Heroku logs show FreeScout API errors.

Diagnostic: check Heroku logs

For all failure modes, start with Heroku logs filtered to FreeScout-related output:

heroku logs --tail --app raxx-console-prod | grep -iE "freescout|auto_ticket|investigate|ticket_filed"

Expected normal output: ticket_filed log lines at INFO level when tickets are created. Silent when no tickets are being created.

Failure mode 1: Ticket flood

Symptom: The FreeScout operations mailbox fills with duplicate tickets for the same alert. Tickets are created every poll cycle (approximately every 30 seconds).

Cause: The in-process alert_key cache is not holding the key across poll cycles. This can happen if: - The cache was cleared by a dyno restart between poll cycles (uncommon — restarts take > 30s normally, and the cache lives for the alert's active duration). - The alert_id being emitted is non-deterministic, so alert_id[:16] produces a different value on every poll cycle for what is logically the same alert. - The alert aggregator is creating new alert_id values for each emission of a persistent alert instead of reusing the same ID for the alert's lifetime.

Diagnose:

Check the audit log for repeated filings with the same alert source but different alert_key values:

SELECT created_at, payload->>'alert_key' AS alert_key, payload->>'severity' AS severity
FROM audit_log
WHERE action = 'console.alerts.ticket_filed'
  AND created_at > NOW() - INTERVAL '2 hours'
ORDER BY created_at DESC
LIMIT 20;

If alert_key values differ only in the last characters for what should be the same alert, the alert aggregator is issuing a new alert_id on each poll cycle.

Confirm the in-process cache is being populated. Look for a cache-miss log pattern in Heroku logs:

heroku logs --app raxx-console-prod | grep "auto_ticket"

Check whether the dyno restarted recently:

heroku ps --app raxx-console-prod
heroku logs --app raxx-console-prod | grep "State changed"

Immediate mitigation:

Flip FLAG_CONSOLE_ALERTS_AUTO_TICKET off to stop the flood while diagnosing:

# Via console UI (preferred — no secrets exposure)
# Navigate to /console/flags, toggle FLAG_CONSOLE_ALERTS_AUTO_TICKET off

# Via Heroku CLI (break-glass only)
heroku config:set FLAG_CONSOLE_ALERTS_AUTO_TICKET=0 --app raxx-console-prod >/dev/null 2>&1

Resolution:

If the alert_id non-determinism is confirmed, file a type:reliability GitHub issue against the alert aggregator to fix alert_id stability for persistent alerts. Do not flip the flag back on until the fix is deployed.

If the cache is being cleared by dyno restarts, consider whether the alert aggregator should persist the alert_key → ticket_id mapping to the console database so it survives restarts. File a type:reliability issue with this scope.

Verification:

After the fix is deployed, flip the flag back on and monitor the audit log. Confirm that repeated poll cycles for the same active alert produce at most one console.alerts.ticket_filed entry (subsequent cycles should be silent — cache hits).

Failure mode 2: Tickets not creating

Symptom: FLAG_CONSOLE_ALERTS_AUTO_TICKET is on and P2+ alerts appear in Heroku logs, but no tickets appear in FreeScout. Or: FLAG_CONSOLE_INVESTIGATE_FROM_STATUS is on, the Investigate button is visible, but clicking it silently fails or returns an error message.

Diagnostic chain — work top to bottom:

Step 1: Confirm FREESCOUT_API_KEY is set and valid

heroku config:get FREESCOUT_API_KEY --app raxx-console-prod

Must return a non-empty value. If empty, retrieve from Infisical and set (silenced):

FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
heroku config:set FREESCOUT_API_KEY="$FS_KEY" --app raxx-console-prod >/dev/null 2>&1

Validate the key against the FreeScout API:

FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer $FS_KEY" \
  https://tickets.raxx.app/api/mailboxes

Expected: 200. A 401 means the key is invalid or revoked — rotate it per docs/ops/runbooks/rotation/ and redeploy. A 403 means the key lacks the required scope — generate a new key with full API access in FreeScout Admin → API Keys.

Step 2: Confirm FREESCOUT_OPERATIONS_MAILBOX_ID is set and valid

heroku config:get FREESCOUT_OPERATIONS_MAILBOX_ID --app raxx-console-prod

Must return a numeric value (e.g. 3). If empty, provision it per docs/ops/runbooks/freescout-operations-mailbox-provisioning.md.

Validate the mailbox ID exists:

FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
MBID=$(heroku config:get FREESCOUT_OPERATIONS_MAILBOX_ID --app raxx-console-prod)
curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer $FS_KEY" \
  "https://tickets.raxx.app/api/mailboxes/${MBID}"

Expected: 200. A 404 means the mailbox was deleted or the ID is wrong — re-run the provisioning SOP.

Step 3: Confirm the alert aggregator emit hook is wired (Entry point B only)

heroku logs --app raxx-console-prod | grep "alert_aggregator emit"

If P2+ alerts appear in logs but no alert_aggregator emit entries are present, the emit hook for FreeScout ticketing was not registered or was lost in a deploy. This requires a code-level investigation — escalate to the feature developer if you cannot find the registration in console/app/services/alert_aggregator.py.

Step 4: Confirm Slack DM is still firing (Entry point B only)

# Check the Slack DM channel for recent P2+ alert messages
# Expected: Slack DM fires even when FLAG_CONSOLE_ALERTS_AUTO_TICKET is off or the ticket API fails

If Slack DMs are firing but tickets are not: the issue is isolated to the FreeScout API call. The most common causes are Step 1 (invalid API key) and Step 2 (missing mailbox ID).

If Slack DMs are also not firing: the alert pipeline itself is broken, not just the ticketing hook. Diagnose the alert aggregator separately — this is outside the scope of the auto-ticketing runbook. See docs/ops/runbooks/console-status-poller.md.

Step 5: Check Heroku logs for explicit FreeScout API errors

heroku logs --app raxx-console-prod | grep -iE "freescout|ticket|HTTPError|ConnectionError"

Look for HTTP error codes (4xx or 5xx), connection timeouts, or SSL errors. These indicate either a FreeScout outage (see Failure mode 4 below) or an API authentication issue.

Failure mode 3: Stale Investigate ticket

Symptom: An operator clicked Investigate on a degraded tile. The surface has since recovered (tile is now GREEN / OPERATIONAL), but the FreeScout ticket is still open.

Cause: The auto-ticketing pipeline creates tickets but does not automatically resolve them. Ticket resolution is always an operator action. The pipeline has no awareness of surface recovery events — it only reacts to degraded states at click time.

Resolution:

The operator must manually mark the ticket resolved in FreeScout:

Open https://tickets.raxx.app.
Find the ticket (use the auto:status tag filter or search by surface name).
Mark the conversation as Resolved.

No code change is needed. This is expected behavior.

Future Investigate clicks:

If the same surface degrades again after the operator resolved the ticket, clicking Investigate will create a new ticket. The 60-minute idempotency window is checked against open tickets only. A resolved ticket is not considered "open" for dedup purposes, so a new degradation event on the same surface always results in a fresh ticket.

Guidance for operators:

Establish a practice of resolving Investigate tickets when the underlying surface returns to operational. A high count of open auto:status tickets in FreeScout indicates surfaces that recovered without anyone closing the loop.

Failure mode 4: FreeScout outage

Symptom: FreeScout (tickets.raxx.app) is returning 5xx, timing out, or is unreachable. Neither the Investigate button nor the alert pipeline can create tickets.

How the system degrades:

Both ticket-creating paths are designed to degrade gracefully when FreeScout is unavailable:

Entry point A (Investigate button): The console returns HTTP 200 with {"degraded": true} in the response body. The operator sees an error message in the UI instead of a ticket URL. No exception is raised; the console remains fully functional.
Entry point B (alerts auto-ticket): The FreeScout API call logs at WARNING level and continues. The Slack DM still fires. Ticket creation for the affected alert cycle is skipped silently.

Confirm graceful degradation:

heroku logs --app raxx-console-prod | grep -iE "freescout.*degraded|freescout.*WARNING|ticket.*failed"

Look for WARNING-level entries rather than ERROR-level. If you see unhandled exceptions or ERROR-level log entries related to the FreeScout client, the graceful path may not be working — file a type:reliability issue.

Resolve the FreeScout outage:

See docs/ops/runbooks/freescout.md for the FreeScout incident response procedure.

After FreeScout recovers:

Tickets are not retroactively created for the outage window. Alert cycles that fired during the outage will have Slack DMs but no corresponding FreeScout tickets. This gap is acceptable — the Slack DMs preserve the alert record. If the underlying alert is still active when FreeScout recovers, the next poll cycle will create a ticket normally (subject to cache state).

Escalation

Escalate to the operator when: - The ticket flood cannot be stopped by flipping the flag off (flag flip is not taking effect within 30 seconds). - The FreeScout API key is invalid and Infisical is unavailable for rotation. - FreeScout is down for more than 30 minutes and the operator has not been notified. - The alert aggregator emit hook is missing from the deployed codebase (requires code investigation). - Audit log entries show console.alerts.ticket_filed but no corresponding tickets appear in FreeScout (possible FreeScout data loss or API response mismatch).

Who to contact: Kristerpher (operator)

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.