Raxx · internal docs

internal · gated

ADR-0091: Alerting Source Selection and On-Call Agent Runtime

Status: Proposed
Date: 2026-05-15 UTC
Last updated: 2026-05-15 UTC (third amend — codename Watchpost → EyeTok; FLAG_CONSOLE_INCIDENTS locked)
Author: raxx-software-architect
Design doc: docs/architecture/eyetok-oncall-agent-2026-05-15.md
Request source: Operator morning standup 2026-05-15 UTC


Context

Raxx is T-8 days from v1 launch (2026-05-23 UTC). The operator needs an always-on triage layer that translates application signals into actionable Slack posts with a structured ticket trail. Two alerting sources are available: Sentry (already provisioned) and Splunk (operator-run EC2). The agent must run somewhere persistent with a webhook-addressable endpoint.

Three questions require architectural decisions:

  1. Which alerting source — Sentry, Splunk, or both layered?
  2. Where does the on-call agent (EyeTok) run?
  3. How does the agent authenticate to Slack, FreeScout, GitHub, and Sentry?

Decision 1: Sentry as primary alerting source (Splunk deferred)

Chosen: Sentry-primary. Splunk added in v1.1 only if infrastructure-layer blind spots materialise.

Rationale: - Sentry is already wired to Raptor, Console, and Queue (Queue DSN added in #2142). It captures the error classes most likely to cause user-visible incidents at v1 scale. - Sentry has native Slack integration and webhook alert rules. Zero-code baseline is achievable in hours. - Splunk on EC2 requires forwarder configuration, saved-search maintenance, and a scheduled alerting mechanism. The operational cost is disproportionate to v1's ~10-user scale. - Splunk Cloud alerting adds ~$150+/mo; self-hosted alerting scheduler adds EC2 maintenance. Neither is justified pre-launch. - Sentry Team plan (~$26/mo) covers 10K errors/mo — sufficient for v1.

What Sentry does not cover: infrastructure-level signals (disk, memory, H12 dyno timeouts outside SDK scope, EC2 health). These are accepted blind spots for v1. If an infrastructure signal causes a user-visible incident and Sentry does not catch it, the blind spot is re-evaluated for v1.1.

Alternatives considered: - Splunk-primary: Broader infrastructure coverage but slower to stand up, higher cost, and requires the operator to maintain a Splunk saved-search pipeline. Rejected for launch. - Both layered at launch: Full coverage but doubles integration surface. Risk of duplicate alerts, alert fatigue, and two-system maintenance burden during a critical launch window. Rejected for launch; deferred to v1.1. - CloudWatch + SNS: Would add another AWS dependency. Raptor/Queue are on Heroku, not EC2. Mismatch. Rejected.

Consequences: - Infrastructure-layer blind spots accepted until v1.1 Splunk layer. - Sentry alert rule maintenance (ownership of alert rule YAML or UI config) falls to the operator or sre-agent. - Sentry error budget (10K errors/mo) must be monitored. If Raptor or Queue becomes noisy, sample-rate tuning is needed before the budget is exhausted.


Decision 2: EyeTok runs as a Heroku worker dyno

Chosen: New Heroku apps raxx-eyetok-staging and raxx-eyetok-prod with a worker dyno.

Rationale: - Heroku Standard-1X dyno ($25/mo) provides a persistent process with an HTTPS webhook endpoint — what Sentry requires to deliver alerts. - Integrates with existing Heroku estate (same infra patterns, same heroku config:set credential management, same heroku logs observability). - Crash recovery: Heroku restarts the dyno within seconds. EC2 process management (systemd, monit) would require more operator overhead. - Anthropic-hosted cloud sessions (the operator's preferred Option B) are future work, not available for v1 (per project_claude_menu_current_vs_future.md). - EC2 co-location with Splunk was considered but rejected: ties EyeTok uptime to Splunk EC2 uptime, which is the wrong dependency for an on-call agent. If Splunk EC2 is down, the on-call agent must still be reachable.

Alternatives considered: - EC2 co-located with Splunk: Cheaper (no incremental cost), but creates operational coupling and requires systemd/process management the operator does not want to maintain. Rejected. - Anthropic-hosted: Ideal long-term. Not available for v1. Deferred. - Lambda/function: Stateless execution does not fit the conversation model (Slack thread = ongoing agent session). Cold start latency is unacceptable for critical alerts. Rejected.

Consequences: - New Heroku apps raxx-eyetok-staging / raxx-eyetok-prod added to the estate. Follows naming convention from project_heroku_app_names.md. - Standard-1X dyno + Postgres Mini = ~$30/mo incremental to Heroku bill. - EyeTok must not share a Postgres instance with Raptor or Queue — it has its own Postgres Mini. - Codename EyeTok locked 2026-05-15 UTC (phonetic callback to ITOC, operator's first job). Service identifier: eyetok (lowercase) for Heroku app names, vault paths, env vars, and log fields.


Decision 3: Credentials via Infisical vault path /Raxx/Eyetok/

Chosen: All EyeTok credentials sourced from Infisical at dyno startup, under a dedicated /Raxx/Eyetok/ path.

Rationale: - Consistent with the platform-wide pattern: credentials in env/secret stores, never in files. - Rotatable without redeploy (Infisical push → heroku config:set >/dev/null 2>&1 → dyno restart). - The Anthropic Managed Agents Vault (reference_claude_secrets_vault.md) is the established pattern for agent credentials. ANTHROPIC_API_KEY follows this pattern. - FreeScout API key already exists in vault (project_freescout_api_key.md) — EyeTok reads from the established path rather than creating a duplicate.

Alternatives considered: - Heroku config vars only (no Infisical): Simpler bootstrap but loses the rotation-without-redeploy property and the vault audit trail. Rejected. - AWS SSM: Appropriate for AWS-resident workloads (feedback_aws_workloads_use_ssm_not_vault.md). EyeTok is Heroku-resident, not AWS-resident. Rejected.

Consequences: - Vault folder /Raxx/Eyetok/ must be created before EyeTok secrets are written (per feedback_vault_folder_must_exist.md). - Credential rotation procedure must be documented in a runbook before launch.


Decision 4: Severity-based channel routing (locked 2026-05-15 UTC)

Chosen: Four dedicated Slack channels replace the generic #ops-alerts / #ops-triage / #ops-digest structure. Channel assignment is determined by severity plus an ET business-hours window.

Channel Mapped severity Agent behavior Notify posture
#raxx-ops-alert-SEV1 severity:critical Post immediately, link to Emergency SOP, wait for human ack — no autonomous action Slack mobile push + sound override; v1.1: PagerDuty page
#raxx-ops-alert-SEV2 severity:high OUTSIDE ET business hours Post + tag operator + propose remediation. Wait for human approval before any change Slack mobile push (call-out)
#raxx-ops-alert-SEV2.5 severity:high DURING ET business hours Same as SEV2 but no mobile-push override (operator at desk) Slack normal notification
#raxx-ops-alert-SEV3 severity:medium + severity:low Agent triages + remediates autonomously. Posts outcome only. No human ping. Agent audit log only

ET business hours window: - 08:00–18:30 ET weekdays = business hours - EDT (approx. Mar–Nov): 12:00–22:30 UTC weekdays - EST (approx. Nov–Mar): 13:00–23:30 UTC weekdays - Weekends = always SEV2 regardless of wall-clock time

EyeTok resolves the current ET offset at runtime to handle the EDT/EST transition automatically.

SEV3 escalation invariant: The agent cannot autonomously promote a SEV3 incident to SEV2 or SEV1. If mid-triage the agent determines an incident is more severe, it posts escalation_recommended: SEV2 (or SEV1) to the SEV3 channel and waits for operator acknowledgement before re-posting to a higher-severity channel. This is enforced at the EyeTok application layer.

PagerDuty: Deferred to v1.1. Pre-launch, Slack mobile push override on SEV1/SEV2 channels handles wake-up. PagerDuty Solo ($21/mo) is adopted only if a real after-hours SEV1 is missed.

Rationale: - Explicit channel-per-severity means Slack notification overrides can be configured per-channel in the Slack app, not per-message. This is more reliable than per-message push flags. - SEV2 vs SEV2.5 split eliminates disruptive mobile push during business hours while maintaining it for off-hours incidents when the operator may be asleep. - SEV3 autonomous posture removes human-ping noise for medium/low incidents, which at v1 scale are expected to be the majority of alert volume. - PagerDuty adds cost and integration overhead that is not justified until a real missed wake-up demonstrates the gap.

Alternatives considered: - Single #ops-alerts channel with message-level priority flags: Simpler channel structure but Slack notification overrides are channel-scoped; you cannot reliably trigger sound overrides per-message. Rejected. - Three channels (alerts/triage/digest) as originally proposed: Does not cleanly encode the business-hours split for SEV2 vs SEV2.5. Rejected in favour of the four-channel model. - PagerDuty at launch: Added cost before need is demonstrated. Deferred.

Consequences: - The four channels (#raxx-ops-alert-SEV1, #raxx-ops-alert-SEV2, #raxx-ops-alert-SEV2.5, #raxx-ops-alert-SEV3) must exist in the Slack workspace before SC-E2 can be claimed. - EyeTok's channel-routing logic (SC-E3) must implement a runtime ET offset resolver. Hardcoding a single UTC window is rejected — it would silently mis-route during DST transitions. - The incident_state schema gains a sev_channel column to record which channel an incident was routed to (SC-E7). - Console severity badges (SC-E11) update from generic critical/high/medium/low to SEV1/SEV2/SEV2.5/SEV3 labels. - Sentry native Slack integration (SC-E2 zero-code baseline) targets #raxx-ops-alert-SEV1 and #raxx-ops-alert-SEV2; the business-hours SEV2 vs SEV2.5 split is handled by EyeTok (SC-E3), not by Sentry's native integration.


Decision 5: FLAG_CONSOLE_INCIDENTS gated at launch (locked 2026-05-15 UTC)

Chosen: /console/incidents is behind FLAG_CONSOLE_INCIDENTS, default off. Flip on once EyeTok is verified live in prod.

Rationale: - Consistent with the platform posture of hiding unavailable features rather than greying them out. - EyeTok must be verified in production before the console surface is meaningful. Shipping the page without a live EyeTok backend would show an empty or erroring state. - Follows the new-flag-needs-b1-migration-same-pr convention: the SC-E10 PR that ships the incidents console page must include a B1 promotion migration for FLAG_CONSOLE_INCIDENTS in the same PR.

Consequences: - SC-E10 is blocked on both the EyeTok stub endpoint (SC-E3) and the flag + migration landing in the same PR. - Default-off means no operator action needed at deploy time; the flag flip is a deliberate post-verification step. - Codename EyeTok locked 2026-05-15 UTC (phonetic callback to ITOC, operator's first job).


Severity mapping (canonical)

The following mapping is canonical. Console, EyeTok, and FreeScout must all use the same severity labels.

Sentry level Scope Mapped severity SEV channel
fatal any critical SEV1
error (new issue, unhandled, production) >0 users affected high SEV2 or SEV2.5 (business-hours split)
error (regression or performance alert) medium SEV3
warning low SEV3
info / debug info audit log only
H12 timeout (via Sentry log integration) high SEV2 or SEV2.5
Paper gate breach (custom event) critical SEV1
Synthetic check failure high SEV2 or SEV2.5

This table is the source of truth. Any deviation in Console badge colours, FreeScout tags, or Slack channel routing is a bug.