Auto-Ticketing Pipeline — Incident Response Runbook
System: Console auto-ticketing (FreeScout operations mailbox)
Owner: operator / sre-agent
Last incident: 2026-06-19 (see RCA docs/incidents/2026-06-19-investigate-no-tickets.md)
Last reviewed: 2026-06-19
Related runbooks:
- Pipeline overview: docs/ops/runbooks/auto-ticketing-pipeline-overview.md
- Rollout sequence: docs/ops/runbooks/auto-ticketing-rollout.md
- Mailbox provisioning: docs/ops/runbooks/freescout-operations-mailbox-provisioning.md
- FreeScout system: docs/ops/runbooks/freescout.md
- Feature flag ops: docs/ops/feature-flags-runbook.md
How to tell the pipeline is broken
- Ticket flood: FreeScout operations mailbox shows many duplicate tickets for the same surface or alert, all created within the same polling window.
- Tickets not creating:
FLAG_CONSOLE_ALERTS_AUTO_TICKETis on, P2+ alerts are visible in Heroku logs, but no tickets appear in FreeScout. Or: Investigate button is visible (flag on) but clicking it returns an error or produces no ticket. - Stale Investigate ticket: Operator clicked Investigate hours ago. The surface recovered, but the FreeScout ticket is still open.
- FreeScout outage: Neither Investigate nor the alert pipeline creates tickets; Heroku logs show FreeScout API errors.
Diagnostic: check Heroku logs
For all failure modes, start with Heroku logs filtered to FreeScout-related output:
heroku logs --tail --app raxx-console-prod | grep -iE "freescout|auto_ticket|investigate|ticket_filed"
Expected normal output: ticket_filed log lines at INFO level when tickets are created. Silent when no tickets are being created.
Failure mode 1: Ticket flood
Symptom: The FreeScout operations mailbox fills with duplicate tickets for the same alert. Tickets are created every poll cycle (approximately every 30 seconds).
Cause: The in-process alert_key cache is not holding the key across poll cycles. This can happen if:
- The cache was cleared by a dyno restart between poll cycles (uncommon — restarts take > 30s normally, and the cache lives for the alert's active duration).
- The alert_id being emitted is non-deterministic, so alert_id[:16] produces a different value on every poll cycle for what is logically the same alert.
- The alert aggregator is creating new alert_id values for each emission of a persistent alert instead of reusing the same ID for the alert's lifetime.
Diagnose:
- Check the audit log for repeated filings with the same alert source but different
alert_keyvalues:
SELECT created_at, payload->>'alert_key' AS alert_key, payload->>'severity' AS severity
FROM audit_log
WHERE action = 'console.alerts.ticket_filed'
AND created_at > NOW() - INTERVAL '2 hours'
ORDER BY created_at DESC
LIMIT 20;
If alert_key values differ only in the last characters for what should be the same alert, the alert aggregator is issuing a new alert_id on each poll cycle.
- Confirm the in-process cache is being populated. Look for a cache-miss log pattern in Heroku logs:
heroku logs --app raxx-console-prod | grep "auto_ticket"
- Check whether the dyno restarted recently:
heroku ps --app raxx-console-prod
heroku logs --app raxx-console-prod | grep "State changed"
Immediate mitigation:
Flip FLAG_CONSOLE_ALERTS_AUTO_TICKET off to stop the flood while diagnosing:
# Via console UI (preferred — no secrets exposure)
# Navigate to /console/flags, toggle FLAG_CONSOLE_ALERTS_AUTO_TICKET off
# Via Heroku CLI (break-glass only)
heroku config:set FLAG_CONSOLE_ALERTS_AUTO_TICKET=0 --app raxx-console-prod >/dev/null 2>&1
Resolution:
If the alert_id non-determinism is confirmed, file a type:reliability GitHub issue against the alert aggregator to fix alert_id stability for persistent alerts. Do not flip the flag back on until the fix is deployed.
If the cache is being cleared by dyno restarts, consider whether the alert aggregator should persist the alert_key → ticket_id mapping to the console database so it survives restarts. File a type:reliability issue with this scope.
Verification:
After the fix is deployed, flip the flag back on and monitor the audit log. Confirm that repeated poll cycles for the same active alert produce at most one console.alerts.ticket_filed entry (subsequent cycles should be silent — cache hits).
Failure mode 2: Tickets not creating
Symptom: FLAG_CONSOLE_ALERTS_AUTO_TICKET is on and P2+ alerts appear in Heroku logs, but no tickets appear in FreeScout. Or: FLAG_CONSOLE_INVESTIGATE_FROM_STATUS is on, the Investigate button is visible, but clicking it silently fails or returns an error message.
Diagnostic chain — work top to bottom:
Step 0: Confirm all five prerequisite env vars are set on the Heroku app
The Investigate path and alert auto-ticket path require ALL of the following on raxx-console-prod:
heroku config --app raxx-console-prod | grep -E "FREESCOUT|CF_ACCESS_SVC_TICKETS"
Required vars (checked 2026-06-19 — all must be non-empty):
| Var | Source | Notes |
|---|---|---|
FREESCOUT_API_URL |
hardcoded https://tickets.raxx.app |
Base URL for all API calls |
FREESCOUT_INSTANCE_URL |
hardcoded https://tickets.raxx.app |
Used for ticket_url link construction |
FREESCOUT_API_KEY |
vault /MooseQuest/freescout/FREESCOUT_API_KEY |
FreeScout API authentication |
FREESCOUT_OPERATIONS_MAILBOX_ID |
vault /MooseQuest/freescout/FREESCOUT_OPERATIONS_MAILBOX_ID |
Must be 2 (Operations mailbox) |
CF_ACCESS_SVC_TICKETS_CLIENT_ID |
vault /MooseQuest/cloudflare/CF_ACCESS_SVC_TICKETS_CLIENT_ID |
CF Access service token — REQUIRED because tickets.raxx.app/api/* is gated by CF Access app ca6fd315 (decision=non_identity). Without this, every API call returns HTTP 403 from CF Access before FreeScout sees it. |
CF_ACCESS_SVC_TICKETS_CLIENT_SECRET |
vault /MooseQuest/cloudflare/CF_ACCESS_SVC_TICKETS_CLIENT_SECRET |
Paired with CLIENT_ID above |
If any of the above are missing, set them from vault (never echo secret values):
# Run inside a Heroku one-off dyno or via the vault_read_retry pattern:
# python3 scripts/agents/vault_read_retry.py (see that module's docstring)
# Then: heroku config:set VAR=value --app raxx-console-prod >/dev/null 2>&1
Step 1: Confirm FREESCOUT_API_KEY is set and valid
heroku config:get FREESCOUT_API_KEY --app raxx-console-prod
Must return a non-empty value. If empty, retrieve from Infisical and set (silenced):
FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
heroku config:set FREESCOUT_API_KEY="$FS_KEY" --app raxx-console-prod >/dev/null 2>&1
Step 1.5: Confirm CF Access service-token headers are set and working
This is the most common silent failure mode (see RCA 2026-06-19-investigate-no-tickets). The tickets.raxx.app/api/* path is gated by CF Access application ca6fd315 (decision=non_identity). Without CF-Access-Client-Id and CF-Access-Client-Secret headers, every outbound call from the console to the FreeScout API returns HTTP 403 from CF Access, not from FreeScout. freescout_client.py logs this as freescout_auth_error.
Check that the service-token vars are set:
heroku config --app raxx-console-prod | grep CF_ACCESS_SVC_TICKETS
# Should show: CF_ACCESS_SVC_TICKETS_CLIENT_ID and CF_ACCESS_SVC_TICKETS_CLIENT_SECRET (non-empty)
Validate the combination (CF Access + FreeScout API key together):
# Read values from vault (do not echo)
CID=$(infisical secrets get CF_ACCESS_SVC_TICKETS_CLIENT_ID --path /MooseQuest/cloudflare --env prod --plain)
CSEC=$(infisical secrets get CF_ACCESS_SVC_TICKETS_CLIENT_SECRET --path /MooseQuest/cloudflare --env prod --plain)
FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
-H "CF-Access-Client-Id: $CID" \
-H "CF-Access-Client-Secret: $CSEC" \
-H "X-FreeScout-API-Key: $FS_KEY" \
-H "User-Agent: raxx-sre-probe/1.0" \
"https://tickets.raxx.app/api/conversations?page=1&pageSize=1")
echo "Status: $HTTP_STATUS"
# Expected: 200 — both CF Access gate and FreeScout API key are valid
# 302 → CF Access is blocking (wrong service token or policy missing)
# 401/403 → CF Access passed but FreeScout API key is invalid/missing
If CF Access service-token vars are missing, set them from vault (never echo):
CID=$(infisical secrets get CF_ACCESS_SVC_TICKETS_CLIENT_ID --path /MooseQuest/cloudflare --env prod --plain)
CSEC=$(infisical secrets get CF_ACCESS_SVC_TICKETS_CLIENT_SECRET --path /MooseQuest/cloudflare --env prod --plain)
heroku config:set CF_ACCESS_SVC_TICKETS_CLIENT_ID="$CID" CF_ACCESS_SVC_TICKETS_CLIENT_SECRET="$CSEC" \
--app raxx-console-prod >/dev/null 2>&1
Validate the key against the FreeScout API:
FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer $FS_KEY" \
https://tickets.raxx.app/api/mailboxes
Expected: 200. A 401 means the key is invalid or revoked — rotate it per docs/ops/runbooks/rotation/ and redeploy. A 403 means CF Access is blocking (missing service-token headers) or the key lacks scope.
Step 2: Confirm FREESCOUT_OPERATIONS_MAILBOX_ID is set and valid
heroku config:get FREESCOUT_OPERATIONS_MAILBOX_ID --app raxx-console-prod
Must return a numeric value (e.g. 3). If empty, provision it per docs/ops/runbooks/freescout-operations-mailbox-provisioning.md.
Validate the mailbox ID exists:
FS_KEY=$(infisical secrets get FREESCOUT_API_KEY --path /MooseQuest/freescout --env prod --plain)
MBID=$(heroku config:get FREESCOUT_OPERATIONS_MAILBOX_ID --app raxx-console-prod)
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer $FS_KEY" \
"https://tickets.raxx.app/api/mailboxes/${MBID}"
Expected: 200. A 404 means the mailbox was deleted or the ID is wrong — re-run the provisioning SOP.
Step 3: Confirm the alert aggregator emit hook is wired (Entry point B only)
heroku logs --app raxx-console-prod | grep "alert_aggregator emit"
If P2+ alerts appear in logs but no alert_aggregator emit entries are present, the emit hook for FreeScout ticketing was not registered or was lost in a deploy. This requires a code-level investigation — escalate to the feature developer if you cannot find the registration in console/app/services/alert_aggregator.py.
Step 4: Confirm Slack DM is still firing (Entry point B only)
# Check the Slack DM channel for recent P2+ alert messages
# Expected: Slack DM fires even when FLAG_CONSOLE_ALERTS_AUTO_TICKET is off or the ticket API fails
If Slack DMs are firing but tickets are not: the issue is isolated to the FreeScout API call. The most common causes are Step 1 (invalid API key) and Step 2 (missing mailbox ID).
If Slack DMs are also not firing: the alert pipeline itself is broken, not just the ticketing hook. Diagnose the alert aggregator separately — this is outside the scope of the auto-ticketing runbook. See docs/ops/runbooks/console-status-poller.md.
Step 5: Check Heroku logs for explicit FreeScout API errors
heroku logs --app raxx-console-prod | grep -iE "freescout|ticket|HTTPError|ConnectionError"
Look for HTTP error codes (4xx or 5xx), connection timeouts, or SSL errors. These indicate either a FreeScout outage (see Failure mode 4 below) or an API authentication issue.
Failure mode 3: Stale Investigate ticket
Symptom: An operator clicked Investigate on a degraded tile. The surface has since recovered (tile is now GREEN / OPERATIONAL), but the FreeScout ticket is still open.
Cause: The auto-ticketing pipeline creates tickets but does not automatically resolve them. Ticket resolution is always an operator action. The pipeline has no awareness of surface recovery events — it only reacts to degraded states at click time.
Resolution:
The operator must manually mark the ticket resolved in FreeScout:
- Open
https://tickets.raxx.app. - Find the ticket (use the
auto:statustag filter or search by surface name). - Mark the conversation as Resolved.
No code change is needed. This is expected behavior.
Future Investigate clicks:
If the same surface degrades again after the operator resolved the ticket, clicking Investigate will create a new ticket. The 60-minute idempotency window is checked against open tickets only. A resolved ticket is not considered "open" for dedup purposes, so a new degradation event on the same surface always results in a fresh ticket.
Guidance for operators:
Establish a practice of resolving Investigate tickets when the underlying surface returns to operational. A high count of open auto:status tickets in FreeScout indicates surfaces that recovered without anyone closing the loop.
Failure mode 4: FreeScout outage
Symptom: FreeScout (tickets.raxx.app) is returning 5xx, timing out, or is unreachable. Neither the Investigate button nor the alert pipeline can create tickets.
How the system degrades:
Both ticket-creating paths are designed to degrade gracefully when FreeScout is unavailable:
- Entry point A (Investigate button): The console returns
HTTP 200with{"degraded": true}in the response body. The operator sees an error message in the UI instead of a ticket URL. No exception is raised; the console remains fully functional. - Entry point B (alerts auto-ticket): The FreeScout API call logs at
WARNINGlevel and continues. The Slack DM still fires. Ticket creation for the affected alert cycle is skipped silently.
Confirm graceful degradation:
heroku logs --app raxx-console-prod | grep -iE "freescout.*degraded|freescout.*WARNING|ticket.*failed"
Look for WARNING-level entries rather than ERROR-level. If you see unhandled exceptions or ERROR-level log entries related to the FreeScout client, the graceful path may not be working — file a type:reliability issue.
Resolve the FreeScout outage:
See docs/ops/runbooks/freescout.md for the FreeScout incident response procedure.
After FreeScout recovers:
Tickets are not retroactively created for the outage window. Alert cycles that fired during the outage will have Slack DMs but no corresponding FreeScout tickets. This gap is acceptable — the Slack DMs preserve the alert record. If the underlying alert is still active when FreeScout recovers, the next poll cycle will create a ticket normally (subject to cache state).
Escalation
Escalate to the operator when:
- The ticket flood cannot be stopped by flipping the flag off (flag flip is not taking effect within 30 seconds).
- The FreeScout API key is invalid and Infisical is unavailable for rotation.
- FreeScout is down for more than 30 minutes and the operator has not been notified.
- The alert aggregator emit hook is missing from the deployed codebase (requires code investigation).
- Audit log entries show console.alerts.ticket_filed but no corresponding tickets appear in FreeScout (possible FreeScout data loss or API response mismatch).
Who to contact: Kristerpher (operator)