ADR-0091: Alerting Source Selection and On-Call Agent Runtime
Status: Proposed
Date: 2026-05-15 UTC
Last updated: 2026-05-15 UTC (third amend — codename Watchpost → EyeTok; FLAG_CONSOLE_INCIDENTS locked)
Author: raxx-software-architect
Design doc: docs/architecture/eyetok-oncall-agent-2026-05-15.md
Request source: Operator morning standup 2026-05-15 UTC
Context
Raxx is T-8 days from v1 launch (2026-05-23 UTC). The operator needs an always-on triage layer that translates application signals into actionable Slack posts with a structured ticket trail. Two alerting sources are available: Sentry (already provisioned) and Splunk (operator-run EC2). The agent must run somewhere persistent with a webhook-addressable endpoint.
Three questions require architectural decisions:
- Which alerting source — Sentry, Splunk, or both layered?
- Where does the on-call agent (EyeTok) run?
- How does the agent authenticate to Slack, FreeScout, GitHub, and Sentry?
Decision 1: Sentry as primary alerting source (Splunk deferred)
Chosen: Sentry-primary. Splunk added in v1.1 only if infrastructure-layer blind spots materialise.
Rationale: - Sentry is already wired to Raptor, Console, and Queue (Queue DSN added in #2142). It captures the error classes most likely to cause user-visible incidents at v1 scale. - Sentry has native Slack integration and webhook alert rules. Zero-code baseline is achievable in hours. - Splunk on EC2 requires forwarder configuration, saved-search maintenance, and a scheduled alerting mechanism. The operational cost is disproportionate to v1's ~10-user scale. - Splunk Cloud alerting adds ~$150+/mo; self-hosted alerting scheduler adds EC2 maintenance. Neither is justified pre-launch. - Sentry Team plan (~$26/mo) covers 10K errors/mo — sufficient for v1.
What Sentry does not cover: infrastructure-level signals (disk, memory, H12 dyno timeouts outside SDK scope, EC2 health). These are accepted blind spots for v1. If an infrastructure signal causes a user-visible incident and Sentry does not catch it, the blind spot is re-evaluated for v1.1.
Alternatives considered: - Splunk-primary: Broader infrastructure coverage but slower to stand up, higher cost, and requires the operator to maintain a Splunk saved-search pipeline. Rejected for launch. - Both layered at launch: Full coverage but doubles integration surface. Risk of duplicate alerts, alert fatigue, and two-system maintenance burden during a critical launch window. Rejected for launch; deferred to v1.1. - CloudWatch + SNS: Would add another AWS dependency. Raptor/Queue are on Heroku, not EC2. Mismatch. Rejected.
Consequences:
- Infrastructure-layer blind spots accepted until v1.1 Splunk layer.
- Sentry alert rule maintenance (ownership of alert rule YAML or UI config) falls to the operator or sre-agent.
- Sentry error budget (10K errors/mo) must be monitored. If Raptor or Queue becomes noisy, sample-rate tuning is needed before the budget is exhausted.
Decision 2: EyeTok runs as a Heroku worker dyno
Chosen: New Heroku apps raxx-eyetok-staging and raxx-eyetok-prod with a worker dyno.
Rationale:
- Heroku Standard-1X dyno ($25/mo) provides a persistent process with an HTTPS webhook endpoint — what Sentry requires to deliver alerts.
- Integrates with existing Heroku estate (same infra patterns, same heroku config:set credential management, same heroku logs observability).
- Crash recovery: Heroku restarts the dyno within seconds. EC2 process management (systemd, monit) would require more operator overhead.
- Anthropic-hosted cloud sessions (the operator's preferred Option B) are future work, not available for v1 (per project_claude_menu_current_vs_future.md).
- EC2 co-location with Splunk was considered but rejected: ties EyeTok uptime to Splunk EC2 uptime, which is the wrong dependency for an on-call agent. If Splunk EC2 is down, the on-call agent must still be reachable.
Alternatives considered: - EC2 co-located with Splunk: Cheaper (no incremental cost), but creates operational coupling and requires systemd/process management the operator does not want to maintain. Rejected. - Anthropic-hosted: Ideal long-term. Not available for v1. Deferred. - Lambda/function: Stateless execution does not fit the conversation model (Slack thread = ongoing agent session). Cold start latency is unacceptable for critical alerts. Rejected.
Consequences:
- New Heroku apps raxx-eyetok-staging / raxx-eyetok-prod added to the estate. Follows naming convention from project_heroku_app_names.md.
- Standard-1X dyno + Postgres Mini = ~$30/mo incremental to Heroku bill.
- EyeTok must not share a Postgres instance with Raptor or Queue — it has its own Postgres Mini.
- Codename EyeTok locked 2026-05-15 UTC (phonetic callback to ITOC, operator's first job). Service identifier: eyetok (lowercase) for Heroku app names, vault paths, env vars, and log fields.
Decision 3: Credentials via Infisical vault path /Raxx/Eyetok/
Chosen: All EyeTok credentials sourced from Infisical at dyno startup, under a dedicated /Raxx/Eyetok/ path.
Rationale:
- Consistent with the platform-wide pattern: credentials in env/secret stores, never in files.
- Rotatable without redeploy (Infisical push → heroku config:set >/dev/null 2>&1 → dyno restart).
- The Anthropic Managed Agents Vault (reference_claude_secrets_vault.md) is the established pattern for agent credentials. ANTHROPIC_API_KEY follows this pattern.
- FreeScout API key already exists in vault (project_freescout_api_key.md) — EyeTok reads from the established path rather than creating a duplicate.
Alternatives considered:
- Heroku config vars only (no Infisical): Simpler bootstrap but loses the rotation-without-redeploy property and the vault audit trail. Rejected.
- AWS SSM: Appropriate for AWS-resident workloads (feedback_aws_workloads_use_ssm_not_vault.md). EyeTok is Heroku-resident, not AWS-resident. Rejected.
Consequences:
- Vault folder /Raxx/Eyetok/ must be created before EyeTok secrets are written (per feedback_vault_folder_must_exist.md).
- Credential rotation procedure must be documented in a runbook before launch.
Decision 4: Severity-based channel routing (locked 2026-05-15 UTC)
Chosen: Four dedicated Slack channels replace the generic #ops-alerts / #ops-triage / #ops-digest structure. Channel assignment is determined by severity plus an ET business-hours window.
| Channel | Mapped severity | Agent behavior | Notify posture |
|---|---|---|---|
#raxx-ops-alert-SEV1 |
severity:critical |
Post immediately, link to Emergency SOP, wait for human ack — no autonomous action | Slack mobile push + sound override; v1.1: PagerDuty page |
#raxx-ops-alert-SEV2 |
severity:high OUTSIDE ET business hours |
Post + tag operator + propose remediation. Wait for human approval before any change | Slack mobile push (call-out) |
#raxx-ops-alert-SEV2.5 |
severity:high DURING ET business hours |
Same as SEV2 but no mobile-push override (operator at desk) | Slack normal notification |
#raxx-ops-alert-SEV3 |
severity:medium + severity:low |
Agent triages + remediates autonomously. Posts outcome only. No human ping. | Agent audit log only |
ET business hours window: - 08:00–18:30 ET weekdays = business hours - EDT (approx. Mar–Nov): 12:00–22:30 UTC weekdays - EST (approx. Nov–Mar): 13:00–23:30 UTC weekdays - Weekends = always SEV2 regardless of wall-clock time
EyeTok resolves the current ET offset at runtime to handle the EDT/EST transition automatically.
SEV3 escalation invariant: The agent cannot autonomously promote a SEV3 incident to SEV2 or SEV1. If mid-triage the agent determines an incident is more severe, it posts escalation_recommended: SEV2 (or SEV1) to the SEV3 channel and waits for operator acknowledgement before re-posting to a higher-severity channel. This is enforced at the EyeTok application layer.
PagerDuty: Deferred to v1.1. Pre-launch, Slack mobile push override on SEV1/SEV2 channels handles wake-up. PagerDuty Solo ($21/mo) is adopted only if a real after-hours SEV1 is missed.
Rationale: - Explicit channel-per-severity means Slack notification overrides can be configured per-channel in the Slack app, not per-message. This is more reliable than per-message push flags. - SEV2 vs SEV2.5 split eliminates disruptive mobile push during business hours while maintaining it for off-hours incidents when the operator may be asleep. - SEV3 autonomous posture removes human-ping noise for medium/low incidents, which at v1 scale are expected to be the majority of alert volume. - PagerDuty adds cost and integration overhead that is not justified until a real missed wake-up demonstrates the gap.
Alternatives considered:
- Single #ops-alerts channel with message-level priority flags: Simpler channel structure but Slack notification overrides are channel-scoped; you cannot reliably trigger sound overrides per-message. Rejected.
- Three channels (alerts/triage/digest) as originally proposed: Does not cleanly encode the business-hours split for SEV2 vs SEV2.5. Rejected in favour of the four-channel model.
- PagerDuty at launch: Added cost before need is demonstrated. Deferred.
Consequences:
- The four channels (#raxx-ops-alert-SEV1, #raxx-ops-alert-SEV2, #raxx-ops-alert-SEV2.5, #raxx-ops-alert-SEV3) must exist in the Slack workspace before SC-E2 can be claimed.
- EyeTok's channel-routing logic (SC-E3) must implement a runtime ET offset resolver. Hardcoding a single UTC window is rejected — it would silently mis-route during DST transitions.
- The incident_state schema gains a sev_channel column to record which channel an incident was routed to (SC-E7).
- Console severity badges (SC-E11) update from generic critical/high/medium/low to SEV1/SEV2/SEV2.5/SEV3 labels.
- Sentry native Slack integration (SC-E2 zero-code baseline) targets #raxx-ops-alert-SEV1 and #raxx-ops-alert-SEV2; the business-hours SEV2 vs SEV2.5 split is handled by EyeTok (SC-E3), not by Sentry's native integration.
Decision 5: FLAG_CONSOLE_INCIDENTS gated at launch (locked 2026-05-15 UTC)
Chosen: /console/incidents is behind FLAG_CONSOLE_INCIDENTS, default off. Flip on once EyeTok is verified live in prod.
Rationale:
- Consistent with the platform posture of hiding unavailable features rather than greying them out.
- EyeTok must be verified in production before the console surface is meaningful. Shipping the page without a live EyeTok backend would show an empty or erroring state.
- Follows the new-flag-needs-b1-migration-same-pr convention: the SC-E10 PR that ships the incidents console page must include a B1 promotion migration for FLAG_CONSOLE_INCIDENTS in the same PR.
Consequences:
- SC-E10 is blocked on both the EyeTok stub endpoint (SC-E3) and the flag + migration landing in the same PR.
- Default-off means no operator action needed at deploy time; the flag flip is a deliberate post-verification step.
- Codename EyeTok locked 2026-05-15 UTC (phonetic callback to ITOC, operator's first job).
Severity mapping (canonical)
The following mapping is canonical. Console, EyeTok, and FreeScout must all use the same severity labels.
| Sentry level | Scope | Mapped severity | SEV channel |
|---|---|---|---|
| fatal | any | critical | SEV1 |
| error (new issue, unhandled, production) | >0 users affected | high | SEV2 or SEV2.5 (business-hours split) |
| error (regression or performance alert) | medium | SEV3 | |
| warning | low | SEV3 | |
| info / debug | info | audit log only | |
| H12 timeout (via Sentry log integration) | high | SEV2 or SEV2.5 | |
| Paper gate breach (custom event) | critical | SEV1 | |
| Synthetic check failure | high | SEV2 or SEV2.5 |
This table is the source of truth. Any deviation in Console badge colours, FreeScout tags, or Slack channel routing is a bug.