Raxx On-Call Cloud Agent + Alerting Architecture
Status: Proposed
Date: 2026-05-15 UTC
Last updated: 2026-05-15 UTC (third amend — codename Watchpost → EyeTok; FLAG_CONSOLE_INCIDENTS locked)
Author: raxx-software-architect
Request source: Operator morning standup 2026-05-15 UTC — Kristerpher Henderson
Companion ADR: docs/architecture/adr/0091-alerting-source-and-oncall-agent.md
Launch target: 2026-05-23 UTC (T-8 days at time of writing)
Operator-confirmed decisions (2026-05-15 UTC)
The following decisions were locked by Kristerpher Henderson after initial design review. They supersede any earlier proposal in this document.
-
Severity-based channel routing (4 channels). Slack channel assignment is strictly severity-driven. The original
#ops-alerts / #ops-triage / #ops-digestproposal is replaced by four dedicated channels — see Section 6. -
ET 08:00–18:30 business hours define SEV2 vs SEV2.5 split. High-severity alerts during business hours post to
#raxx-ops-alert-sev2-5(no mobile-push override); outside business hours they post to#raxx-ops-alert-sev2(mobile push + call-out). UTC translations: EDT window (Mar–Nov) is 12:00–22:30 UTC weekdays; EST window (Nov–Mar) is 13:00–23:30 UTC weekdays. Weekends always route to SEV2 regardless of time. -
PagerDuty deferred to v1.1. Pre-launch: Sentry → Slack with channel-specific mobile push override for SEV1/SEV2. Operator's phone push handles wake-up. PagerDuty Solo ($21/mo) is added only if a real after-hours SEV1 is missed.
-
SEV3 = agent autonomous, but cannot self-escalate. Agent triages and remediates medium/low incidents without human ping. If the agent determines mid-triage that the incident is worse than scored, it posts
escalation_recommended: SEV2to the SEV3 channel and waits for operator ack — it does not self-promote. -
FreeScout mailbox confirmed:
ops@raxx.app. No new mailbox needed. On-call tickets land in the existingops@raxx.appmailbox. Unblocks SC-E5. -
Slack channel IDs confirmed. All four channels exist in the MooseQuest workspace. IDs locked 2026-05-15 UTC — see Section 6.
-
Codename: EyeTok. The on-call cloud agent service is codenamed EyeTok (phonetic callback to ITOC, operator's first job). Locked 2026-05-15 UTC. Service identifier:
eyetok(lowercase) for Heroku app names, vault paths, env vars, and log fields. -
FLAG_CONSOLE_INCIDENTS = gated at launch.
/console/incidentsis gated behindFLAG_CONSOLE_INCIDENTS(default off). Flip on once EyeTok is verified live in prod. The SC-E10 PR must include a B1 promotion migration for this flag in the same PR per the new-flag-needs-b1-migration-same-pr convention.
1. Context
Kristerpher needs a tireless first-responder between himself and the noise of a live trading platform. The on-call cloud agent is that responder: it watches for incident-grade signals, reads existing runbooks, posts structured triage to Slack, and keeps a ticket trail in FreeScout — so the operator wakes to context, not chaos.
Scope of this document:
- Alerting source selection (Sentry vs Splunk vs layered)
- Cloud agent runtime, identity, and tool surface
- Slack engagement protocol and quiet-hour posture
- SOP/runbook integration
- FreeScout ticket lifecycle
- Console severity wiring (/console/incidents)
- Pre-launch MVP boundary vs post-launch phases
Not in scope: customer-facing alerting, market-data SLA monitoring, financial audit events (those flow through the existing KMS hash-chain audit trail per ADR-0022).
2. Invariants
The following constraints are non-negotiable for every design choice in this document.
- No stored credentials. The agent identity holds tokens via the Anthropic Managed Agents Vault or Infisical. It never persists credentials in code, config files, or agent memory.
- Audit trail for every state change. Every agent decision (triage conclusion, proposed action, escalation) writes a structured log row linked to the incident ticket.
- Agent does not autonomously execute customer-facing actions. Per the platform's deterministic-execution posture, the agent proposes; the operator decides. The agent may read, post, and create tickets — it never fires orders, modifies RBAC, or changes live-trading flags without explicit operator confirmation.
- GDPR by default. If the agent processes user PII (e.g., a log line containing an email address), it must redact before storing or posting to Slack. Runbook context is not PII-bearing; log snippets shared in Slack must be sanitised at the ingestion step.
- Digest posture pre-launch. New low/medium alert paths are digest-batched until 2026-05-23 UTC per the existing notification posture (
docs/architecture/ci-notification-posture.md). Critical/High fire immediately, always. - Secrets in vault, not files. Agent credentials for Slack, Sentry, FreeScout, and GitHub sit in the Anthropic Managed Agents Vault (reference:
reference_claude_secrets_vault.md). None shipped in this repo. - Paper-first gating is not the agent's concern. The agent observes signals and proposes; it does not bypass or override paper-profitable gates.
3. Alerting Source Decision
3.1 Candidates
Sentry (provisioned, active): application-layer error and performance capture across Raptor, Antlers, Console, and Queue. Captures exceptions, slow transactions, and custom events. Has native Slack integration and webhook-based alert rules. Latency from event to alert: typically 15–60 seconds. Already receiving live events as of Queue DSN wiring (#2142).
Splunk (operator-run EC2 instance): log-aggregation and search. Strong query language (SPL), supports threshold and anomaly alerts. Covers infrastructure-layer signals that Sentry cannot: disk, memory, network, H12s outside Sentry SDK scope. Alert delivery requires either Splunk Cloud (paid) or a self-hosted alerting scheduler (webhook action from Splunk saved searches). Latency from event to alert: 1–5 minutes depending on scheduled search frequency.
3.2 Coverage comparison
| Signal class | Sentry | Splunk |
|---|---|---|
| Unhandled exceptions (Raptor / Queue / Console) | Yes — primary | Via log forwarding only |
| Slow transactions / P99 degradation | Yes (performance) | With metric extraction |
| H12 request timeout (Heroku dyno) | Partial (if SDK captures) | Yes (router log) |
| 5xx rate spike | Yes (issue alert) | Yes (SPL threshold) |
| Memory / CPU / disk on EC2 | No | Yes |
| Custom business events (e.g., paper gate breach) | Yes (custom capture) | Yes (log event) |
| Infra cold-start / deploy smoke | No | Yes |
| Log-level patterns (WARN flood, audit gap) | No | Yes |
3.3 Recommendation: Sentry-primary, Splunk-secondary (deferred)
Decision: Sentry is the v1 alerting source. Splunk is a valid secondary layer for infrastructure-level signals but adds operational burden (Splunk forwarder config, saved-search maintenance, EC2 uptime dependency) that is disproportionate to v1 scale (~10 users). Sentry covers the classes of signals most likely to cause user-visible incidents at launch. Splunk can be added in v1.1 if infrastructure-layer blind spots materialise.
Pre-launch readiness: Sentry is already wired. Slack integration is one toggle in the Sentry project settings + one alert rule. It can be live in hours. A Splunk forwarder pipeline would take 1–3 days to design and test correctly.
Cost: Sentry Team plan is ~$26/mo for the first 10K errors/mo. At v1 scale this is sufficient. Splunk self-hosted on EC2 costs only the instance time (already paid) but Splunk Cloud alerting adds ~$150+/mo — not justified at v1.
3.4 Severity mapping
| Source signal | Mapped severity | Slack channel | Ticket created? |
|---|---|---|---|
| Sentry: CRITICAL issue (level=fatal, affecting >10% of sessions) | severity:critical |
#raxx-ops-alert-sev1 (immediate) |
Yes — P1 |
| Sentry: HIGH issue (unhandled exception, production, new) | severity:high |
#raxx-ops-alert-sev2 or #raxx-ops-alert-sev2-5 (business-hours split — see Section 6) |
Yes — P2 |
| Sentry: MEDIUM (regression, performance degraded, spike) | severity:medium |
#raxx-ops-alert-sev3 |
Yes — P3 |
| Sentry: LOW (warning-level, non-regression) | severity:low |
#raxx-ops-alert-sev3 (batched) |
Optional |
| Sentry: INFO (deploy markers, synthetic pass) | severity:info |
agent audit log only | No |
| Heroku H12 (via Sentry log integration) | severity:high |
#raxx-ops-alert-sev2 / sev2-5 |
Yes — P2 |
| Paper gate breach (custom Sentry event) | severity:critical |
#raxx-ops-alert-sev1 |
Yes — P1 |
| Synthetic check failure | severity:high |
#raxx-ops-alert-sev2 / sev2-5 |
Yes — P2 |
Severity badges in Console mirror this table exactly (see Section 7).
4. Cloud Agent Architecture
4.1 Runtime location
Three options considered:
Option A — EC2 long-running process (operator's existing infra): A small Python process polls or listens for webhooks, calls the Claude API, and manages conversation state. Cheapest infra ($0 incremental if sharing Splunk EC2). Requires the operator to maintain the process (systemd, restarts, log rotation). On-call agent going down = no triage coverage. Risk: single point of failure at the worst time.
Option B — Anthropic cloud-hosted agent (Claude.ai managed sessions): Agent sessions run in Anthropic infrastructure. No infra to maintain. But as of 2026-05-15 UTC, the "Claude menu" / live SDK sessions capability is future work (per project_claude_menu_current_vs_future.md). Not available for v1.
Option C — Heroku worker dyno (existing Heroku estate): A small Python/Flask app deployed as a worker dyno on a new micro-app raxx-eyetok-prod. Receives webhooks from Sentry, calls Claude API, posts to Slack, creates FreeScout tickets. Heroku Standard dyno = ~$25/mo. Restarts on crash. No additional EC2 management. Heroku handles TLS for the webhook endpoint.
Decision: Option C — Heroku worker dyno. Integrates naturally with the existing Heroku estate. Webhook endpoint is a proper HTTPS URL Sentry can target. Crash = Heroku restarts within seconds. No new EC2 to manage. Agent identity and credentials follow the same vault pattern as other services.
The dyno app is codenamed EyeTok (phonetic callback to ITOC, operator's first job — locked 2026-05-15 UTC).
4.2 Agent identity and credentials
EyeTok holds the following credentials, all sourced from Infisical at dyno startup (never committed to the repo):
| Credential | Purpose | Vault path |
|---|---|---|
ANTHROPIC_API_KEY |
Claude API calls | /Raxx/Eyetok/ANTHROPIC_API_KEY |
SENTRY_WEBHOOK_SECRET |
Validate inbound Sentry webhooks (HMAC) | /Raxx/Eyetok/SENTRY_WEBHOOK_SECRET |
SLACK_BOT_TOKEN |
Post to Slack channels, reply in threads | /Raxx/Eyetok/SLACK_BOT_TOKEN |
SLACK_SIGNING_SECRET |
Validate inbound Slack events (operator replies) | /Raxx/Eyetok/SLACK_SIGNING_SECRET |
FREESCOUT_API_KEY |
Create / update tickets at tickets.raxx.app | /Raxx/Eyetok/FREESCOUT_API_KEY |
GITHUB_APP_TOKEN |
Read runbooks, file issues (via raxx-dev-bot) | /Raxx/Eyetok/GITHUB_APP_TOKEN |
Credentials are rotatable without redeploy (Infisical push → Heroku config:set → dyno restart). The ANTHROPIC_API_KEY rotation procedure is: new key in vault → heroku config:set ANTHROPIC_API_KEY=... -a raxx-eyetok-prod >/dev/null 2>&1 → old key revoked at Anthropic console.
4.3 Trigger model
sequenceDiagram
participant Sentry
participant EyeTok as EyeTok (raxx-eyetok-prod dyno)
participant Claude as Claude API
participant Slack
participant FreeScout
Sentry->>EyeTok: POST /webhooks/sentry (HMAC-signed)
EyeTok->>EyeTok: validate HMAC, deduplicate (issue_id)
EyeTok->>EyeTok: fetch matching runbook from docs/ops/runbooks/
EyeTok->>Claude: invoke agent with alert context + runbook
Claude-->>EyeTok: triage summary + proposed actions
EyeTok->>Slack: post to SEV channel (severity + business-hours-routed)
EyeTok->>FreeScout: create ticket (P1/P2/P3)
EyeTok->>EyeTok: store incident_state{issue_id, thread_ts, ticket_id}
Slack->>EyeTok: POST /webhooks/slack (operator reply in thread)
EyeTok->>Claude: continue conversation with operator message
Claude-->>EyeTok: follow-up response
EyeTok->>Slack: reply in thread
Deduplication: EyeTok maintains an in-memory + Postgres-backed incident_state table keyed on Sentry issue_id. Re-fires within a 30-minute window do not spawn a new thread or ticket. They post a "re-fired (N)" update to the existing thread.
Low-severity batching: For severity:low and severity:info, EyeTok accumulates events in Postgres. SEV3 (severity:low) batches are included in the agent's SEV3 autonomous triage pass. severity:info is written to the agent audit log only.
4.4 Tool surface (MCPs)
The Claude agent inside EyeTok has access to the following tools:
| Tool | MCP | What agent uses it for |
|---|---|---|
| GitHub file read | GitHub MCP (raxx-dev-bot) | Read runbooks at docs/ops/runbooks/ on each incident |
| Slack post / reply | Slack MCP | Post alerts, reply in threads |
| FreeScout ticket CRUD | REST calls (no MCP) | Create, update, close incident tickets |
| Sentry issue query | Sentry REST API | Fetch additional context (stack trace, affected users) |
| Console incidents write | Internal REST endpoint | Sync incident state to /console/incidents |
FreeScout has no known MCP; the agent calls the FreeScout REST API directly using the API key (per project_freescout_api_key.md). Note the UI-only limitations (project_freescout_api_limits.md): auto-reply templates are UI-only and cannot be triggered via API. Tickets are created via POST to the conversations endpoint.
4.5 Memory and state
The agent does not carry cross-incident memory. Each incident is a fresh Claude invocation with the alert context + runbook text + conversation history for that Slack thread. This is intentional: cross-incident contamination would reduce triage precision.
Thread history (the running Slack conversation) is passed as context on each Slack reply, bounded to the last 20 exchanges to control token cost.
incident_state table (EyeTok Postgres):
| Column | Type | Notes |
|---|---|---|
incident_id |
UUID PK | Generated on first fire |
sentry_issue_id |
TEXT UNIQUE | Deduplication key |
severity |
TEXT | critical/high/medium/low/info |
sev_channel |
TEXT | SEV1/SEV2/SEV2.5/SEV3 — actual channel routed at fire time |
slack_thread_ts |
TEXT | Slack thread timestamp for replies |
slack_channel_id |
TEXT | Slack channel ID (see Section 6.1) |
freescout_ticket_id |
INT | |
status |
TEXT | open/acknowledged/in_progress/resolved/escalated/escalation_recommended |
agent_transcript |
JSONB | Array of {role, content} pairs for audit |
created_at |
TIMESTAMPTZ | |
resolved_at |
TIMESTAMPTZ | Nullable |
runbook_used |
TEXT | Filename of matched runbook, or null |
4.6 Cost estimate
| Item | v1 (~10 users) | v1.1 (~100 users) |
|---|---|---|
Heroku Standard-1X dyno (raxx-eyetok-prod) |
~$25/mo | ~$25/mo |
| Heroku Postgres (Mini) | ~$5/mo | ~$5/mo (same scale) |
| Claude API tokens | ~$5–15/mo (est. 5–20 incidents/day × 2K tokens each) | ~$20–50/mo |
| Sentry Team plan | ~$26/mo | ~$26/mo |
| PagerDuty Solo | deferred to v1.1 | ~$21/mo (if needed) |
| Total | ~$61–71/mo | ~$97–127/mo |
Token cost assumes average incident = 1 runbook read (~500 tokens) + 1 triage response (~1K tokens) + 2–3 operator exchanges (~500 tokens each). Worst case at v1 is ~$30/mo if incident rate spikes.
5. SOP Integration
Runbooks live at docs/ops/runbooks/. The agent reads them at incident time, not on cold-start (cold-start read would bloat context unnecessarily).
5.1 Runbook matching
Each runbook file must include a YAML frontmatter block:
---
incident_class: [h12_timeout, sentry_fatal, paper_gate_breach, db_connection_pool, ...]
component: [raptor, antlers, queue, console, velvet]
severity_floor: high
---
EyeTok normalises the Sentry alert into {incident_class, component} and fetches the matching runbook via the GitHub API. If multiple runbooks match, all are included (ordered by severity_floor descending). If zero match, the agent proceeds with a "no runbook found" preamble and proposes a draft runbook as part of its response.
5.2 No-runbook path
When the agent finds no matching runbook:
1. Posts triage based on the raw alert signal.
2. Appends: "No matching runbook found for {incident_class}. Proposed draft runbook attached."
3. Files a separate GitHub issue tagged runbook-needed against the relevant component.
The operator's confirmation in-thread to accept the draft runbook triggers a PR from feature-developer. This closes the loop from incident → SOP.
5.3 SOP authorship loop
graph LR
Incident -->|no runbook matched| EyeTokDraft[Agent proposes draft]
EyeTokDraft -->|operator approves in thread| GHIssue[GH Issue: runbook-needed]
GHIssue -->|card-groomer grooms| FeatureDev[feature-developer writes runbook]
FeatureDev -->|PR merged| RunbookFile[docs/ops/runbooks/*.md]
RunbookFile -->|next similar incident| EyeTok[Agent reads runbook]
6. Slack Engagement Protocol
6.1 Channel structure (locked 2026-05-15 UTC)
Four dedicated channels replace the original #ops-alerts / #ops-triage / #ops-digest proposal. Channel assignment is determined at fire time by the combination of severity and business-hours window.
Channel naming note: Slack normalises channel names to lowercase with hyphens only — capital letters are lowercased and periods are converted to hyphens. The SEV identifier (SEV1, SEV2, SEV2.5, SEV3) remains unchanged everywhere in code, docs, and discussions. Only the Slack channel string is affected by Slack's normalisation (e.g., SEV2.5 → channel name sev2-5). The sev_channel column in incident_state stores the canonical SEV identifier, not the Slack channel string.
| SEV identifier | Slack channel name | Slack channel ID | Mapped severity | Agent behavior | Notify posture |
|---|---|---|---|---|---|
| SEV1 | #raxx-ops-alert-sev1 |
C0B423M38H4 |
severity:critical |
Post immediately, link to Emergency SOP, wait for human ack — no autonomous action | Slack mobile push + sound override; v1.1: PagerDuty page |
| SEV2 | #raxx-ops-alert-sev2 |
C0B445RU95Y |
severity:high OUTSIDE ET business hours |
Post + tag operator + propose remediation. Wait for human approval before any change | Slack mobile push (call-out) |
| SEV2.5 | #raxx-ops-alert-sev2-5 |
C0B4611UC2V |
severity:high DURING ET business hours |
Same as SEV2 but no mobile-push override (operator at desk) | Slack normal notification |
| SEV3 | #raxx-ops-alert-sev3 |
C0B4615LQ49 |
severity:medium + severity:low |
Agent triages + remediates autonomously. Posts outcome only. No human ping. | Agent audit log only |
6.2 ET business hours window
The SEV2 / SEV2.5 split is governed by Eastern Time (ET) business hours:
- 08:00–18:30 ET weekdays = business hours → route
severity:highto#raxx-ops-alert-sev2-5 - All other times (nights, weekends) = outside business hours → route
severity:highto#raxx-ops-alert-sev2
UTC translations: - EDT (approx. Mar–Nov): business hours = 12:00–22:30 UTC weekdays - EST (approx. Nov–Mar): business hours = 13:00–23:30 UTC weekdays - Weekend: always SEV2 regardless of wall-clock time
EyeTok must resolve the current ET offset at runtime (system timezone or a tzdata lookup) rather than hardcoding one UTC window, to handle the EDT/EST transition automatically.
6.3 SEV3 escalation invariant
The agent CANNOT autonomously promote a SEV3 incident to SEV2 or SEV1. If mid-triage the agent determines the incident is more severe than initially scored:
- It posts its updated assessment to the SEV3 channel with field
escalation_recommended: SEV2(or SEV1). - It waits for operator acknowledgement of the promotion in-thread.
- Only after operator ack does EyeTok re-post to the appropriate SEV2/SEV1 channel and create the escalated ticket.
This invariant is enforced at the EyeTok application layer, not left to agent discretion.
6.4 PagerDuty deferral
PagerDuty is deferred to v1.1. Pre-launch the SEV1 and SEV2 mobile push override on the dedicated Slack channels is the wake-up mechanism. Rationale: operator's phone Slack push handles the practical requirement; adding PagerDuty Solo ($21/mo) before a real missed SEV1 demonstrates the need is premature overhead.
Trigger for PagerDuty adoption: a real after-hours SEV1 incident is missed because Slack push was silenced or ignored. At that point the operator adds PagerDuty without a design change — the EyeTok posting path is already SEV-channel-structured and compatible.
6.5 Alert message structure
[SEV1 / CRITICAL] Raxx — Unhandled exception in Queue (10 users affected)
Sentry: <https://sentry.io/...>
Ticket: <https://tickets.raxx.app/conversations/...>
Triage summary: Queue worker is throwing NullPointerException in the billing
settlement path after the Stripe webhook arrives with a missing metadata field.
Matches runbook: `queue-stripe-webhook-error.md`
Proposed action:
1. Confirm Stripe test-mode webhook is not misconfigured.
2. Restart Queue worker dyno (`heroku restart -a raxx-queue-prod`).
3. If restart does not clear: roll back to prior deploy.
Reply in this thread to engage agent.
Status updates posted to the same thread as the conversation progresses:
- [ACKNOWLEDGED] — operator first reply
- [IN PROGRESS] — operator has acted
- [RESOLVED] — operator marks resolved; ticket closes
- [ESCALATED] — no response in 15 min for SEV1, 60 min for SEV2/2.5
- [ESCALATION RECOMMENDED → SEV2] — agent-flagged mid-triage upgrade, awaiting operator ack
6.6 Operator engagement
The operator replies in the thread. EyeTok's Slack Events API subscription routes thread replies on #raxx-ops-alert-sev1, #raxx-ops-alert-sev2, and #raxx-ops-alert-sev2-5 to the Claude agent continuation. #raxx-ops-alert-sev3 posts are outcome-only; the operator can reply but it does not trigger agent continuation unless the operator explicitly asks.
7. Ticket Integration
Every incident at severity medium or above creates a FreeScout ticket. Low/info severity does not create tickets (digest-only).
Ticket fields:
| FreeScout field | Value |
|---|---|
| Subject | [{SEV_CHANNEL}] {Sentry title} |
| Mailbox | ops@raxx.app (confirmed — existing internal queue) |
| Tags | eyetok-agent, {severity}, {sev_channel}, {component} |
Custom: incident_severity |
maps to existing FreeScout custom field (runbook: freescout-status-fields.md) |
Custom: component_tag |
e.g. queue, raptor |
| Body | Agent triage summary + Slack thread link + Sentry link |
Ticket lifecycle: - Created: on first Sentry webhook fire (severity >= medium) - Updated: each agent status post (acknowledged, in progress, resolved) writes a FreeScout note via API - Closed: when operator posts "resolved" in Slack thread, EyeTok closes the FreeScout ticket via API
Audit trail: incident_state.agent_transcript (JSONB) is the structured record of what the agent observed, proposed, and what the operator chose. This is retained for 90 days, consistent with audit log retention policy.
8. Console Integration — Severity Wiring
Operator asked: "make sure that all gets wired into Console somehow."
8.1 /console/incidents page (v1 — read-only, gated behind FLAG_CONSOLE_INCIDENTS)
A new console page surfaces all open incidents. Gating: the page is behind FLAG_CONSOLE_INCIDENTS (default off). The SC-E10 PR must ship the flag + a B1 promotion migration in the same PR per the new-flag-needs-b1-migration-same-pr convention. The flag is flipped on once EyeTok is verified live in prod.
The page shows: - Severity badge (SEV1 / SEV2 / SEV2.5 / SEV3 — reuses existing badge component conventions from Console) - Sentry issue link - FreeScout ticket link - Slack thread link - Agent triage summary (truncated to 3 lines; expandable) - Status chip: open / acknowledged / in-progress / resolved / escalated / escalation-recommended
Data source: EyeTok exposes a REST endpoint GET /api/incidents (authenticated via Console's internal service token). Console polls every 60 seconds.
8.2 Write-paths (post-v1)
In v1 the console page is read-only — acknowledge, escalate, resolve happen in Slack. Post-v1 the Console can add action buttons that POST back to EyeTok, which then propagates to Slack thread + FreeScout. This keeps the audit trail centrally owned by EyeTok.
8.3 Severity badge component
Reuses the severity badge convention from the existing status page (ADR-0030 state machine colours). Mapping:
| Severity / Channel | Badge colour | Label |
|---|---|---|
| SEV1 / critical | Red | SEV1 |
| SEV2 / high (off-hours) | Orange | SEV2 |
| SEV2.5 / high (business hours) | Amber | SEV2.5 |
| SEV3 / medium + low | Yellow | SEV3 |
| info | Grey | INFO |
9. Pre-Launch vs Post-Launch Staging
T-8 days to launch (2026-05-23 UTC).
Pre-launch MVP (ship by 2026-05-21 UTC)
- Sentry Slack integration enabled natively (toggle in Sentry UI, no code) — alert rules for fatal/error at
severity:criticalandseverity:high→#raxx-ops-alert-sev1/#raxx-ops-alert-sev2. This is the zero-code baseline. - EyeTok dyno scaffolded: accepts Sentry webhooks, routes to SEV channels (including business-hours SEV2 vs SEV2.5 split), creates FreeScout tickets at
ops@raxx.app, posts to Slack with agent triage. No SOP reading yet. incident_statePostgres table provisioned (includingsev_channelcolumn)./console/incidentspage: stub — shows count of open P1/P2 tickets from FreeScout API. Full incident view is post-launch. Gated behindFLAG_CONSOLE_INCIDENTS(default off).
This MVP can be shipped in 3–4 days by a single feature-developer.
Post-launch v1.1 (within 2 weeks of launch)
- Runbook matching and SOP reading in agent context
- Full
/console/incidentswith SEV badges, Slack thread links, ticket links, agent transcript - SEV3 autonomous triage + remediation fully active
- Draft runbook proposal on no-match
- PagerDuty Solo integration (triggered if a real after-hours SEV1 is missed)
Post-launch v1.2 (within 4 weeks of launch)
- SOP authorship loop (agent files GH issue on no-match)
- Console write-paths (ack/resolve from Console)
- Splunk secondary layer for infrastructure signals
- Escalation timer tuning based on real incident data
10. Migrations
The incident_state table and the agent_transcript JSONB column are new — no existing schema is affected. EyeTok is a new Heroku app; it does not touch Raptor, Queue, or Console databases directly.
Console gains a new page (/console/incidents) backed by an outbound HTTP call to EyeTok — no DB migration in Console.
Rollback: disable the EyeTok dyno (heroku scale worker=0 -a raxx-eyetok-prod). Sentry native Slack integration (the zero-code baseline) continues to fire. No data is lost.
11. Security Considerations
| Question | Answer |
|---|---|
| What PII does this collect? | Log excerpts from Sentry alerts may contain email addresses or user IDs. EyeTok must redact known PII patterns (email regex, UUID-to-user mapping) before posting to Slack or storing in agent_transcript. |
| Retention period? | incident_state rows retained 90 days. agent_transcript same. Purge job runs weekly. |
| DSR erasure path? | User incidents do not reference PII at the ticket level. If a user requests erasure and a ticket references their data, the agent_transcript JSONB is nulled for that row. Manual DSR handling runbook (docs/ops/runbooks/manual-dsr-handling.md) covers this. |
| Audit log? | Every agent action is appended to agent_transcript with timestamp, role, and content. Redacted before storage. |
| Credential replay risk? | No. Agent tokens are short-lived (Infisical dynamic secrets where possible) and never stored in the repo. |
| Breach notification? | If EyeTok itself is compromised, the FREESCOUT_API_KEY and SLACK_BOT_TOKEN are rotatable in <5 minutes via Infisical → heroku config:set. Breach notification follows existing GDPR breach runbook. |
| Secrets location? | All in Infisical under /Raxx/Eyetok/. Rotatable without redeploy. |
| Kill-switch? | heroku scale worker=0 -a raxx-eyetok-prod shuts down EyeTok in seconds. Sentry native Slack alerts continue as fallback. |
12. Open Questions (operator decision needed)
-
~~Slack workspace + channel names / IDs.~~ Resolved 2026-05-15 UTC. All four channels confirmed and created in the MooseQuest workspace. Channel IDs locked — see Section 6.1. SC-E2 is now unblocked.
-
~~Quiet-hour policy.~~ Resolved 2026-05-15 UTC. ET 08:00–18:30 weekdays. No explicit quiet-hour suppression — SEV1/SEV2 fire at all hours with mobile push. SEV3 is audit-log-only so quiet hours are moot for low/medium.
-
Splunk forwarder scope: If infrastructure-layer signal gaps materialise post-launch, what EC2 instance does EyeTok target? Operator to provide Splunk instance details if/when v1.1 Splunk layer is activated. (Post-launch — no action needed before 2026-05-23 UTC.)
-
~~FreeScout mailbox for incidents.~~ Resolved 2026-05-15 UTC. Confirmed:
ops@raxx.app(existing mailbox). No new mailbox required. SC-E5 is now unblocked. -
~~Console incidents page gating.~~ Resolved 2026-05-15 UTC.
FLAG_CONSOLE_INCIDENTSgated at launch (default off). SC-E10 PR must include B1 promotion migration in the same PR. Flip on once EyeTok is verified live in prod.
13. Sub-Cards for PM
| # | Title | Labels | Size | Dependencies |
|---|---|---|---|---|
| SC-E1 | Provision raxx-eyetok-staging / raxx-eyetok-prod Heroku dynos + Postgres + Infisical vault paths at /Raxx/Eyetok/ |
infra, eyetok, p1-launch |
S (1 day) | None |
| SC-E2 | Sentry alert rules → Slack native integration targeting #raxx-ops-alert-sev1 + #raxx-ops-alert-sev2 (zero-code baseline) |
alerting, sentry, p1-launch |
XS (2 hrs) | Channels confirmed (done — IDs in Section 6.1) |
| SC-E3 | EyeTok: webhook receiver + HMAC validation + SEV channel routing (including ET business-hours SEV2 vs SEV2.5 split) | eyetok, backend, p1-launch |
M (2 days) | SC-E1 |
| SC-E4 | EyeTok: Claude agent invocation + triage post to SEV channel; SEV3 escalation-recommended guard | eyetok, ai, p1-launch |
M (2 days) | SC-E3 |
| SC-E5 | EyeTok: FreeScout ticket create/update/close lifecycle (mailbox: ops@raxx.app) |
eyetok, ticketing, p1-launch |
M (2 days) | SC-E1 |
| SC-E6 | EyeTok: Slack Events API subscription (operator thread reply → agent continuation on SEV1/SEV2/SEV2.5) | eyetok, slack, p1-launch |
M (2 days) | SC-E4 |
| SC-E7 | EyeTok: incident_state Postgres schema + sev_channel column + deduplication + low-severity batching |
eyetok, database |
S (1 day) | SC-E1 |
| SC-E8 | Runbook frontmatter standard: add incident_class + component YAML to all existing runbooks |
docs, runbooks |
S (1 day) | None |
| SC-E9 | EyeTok: runbook matching + GitHub API fetch + no-runbook draft path | eyetok, sop |
M (2 days) | SC-E8, SC-E4 |
| SC-E10 | Console: FLAG_CONSOLE_INCIDENTS feature flag + B1 promotion migration + /console/incidents stub (count badge) — flag default off; flip on once EyeTok verified in prod |
console, frontend, p1-launch |
S (1 day) | SC-E3 |
| SC-E11 | Console: /console/incidents full page — SEV badges (SEV1/SEV2/SEV2.5/SEV3), Slack links, ticket links, agent transcript |
console, frontend |
L (3 days) | SC-E10, SC-E4, SC-E5 |
| SC-E12 | EyeTok: PII redaction layer for log excerpts before Slack post / transcript storage | security, gdpr, eyetok |
M (2 days) | SC-E4 |
| SC-E13 | EyeTok: ET business-hours runtime resolver for SEV2 vs SEV2.5 channel routing | eyetok, ops |
S (1 day) | SC-E3 |
| SC-E14 | incident_state 90-day purge job + DSR erasure hook |
gdpr, eyetok, compliance |
S (1 day) | SC-E7 |
MVP for launch (2026-05-23 UTC): SC-E1, SC-E2, SC-E3, SC-E4, SC-E5, SC-E7, SC-E10, SC-E12.
Post-launch v1.1: SC-E6, SC-E8, SC-E9, SC-E11, SC-E13, SC-E14.