Raxx · internal docs

internal · gated

Raxx On-Call Cloud Agent + Alerting Architecture

Status: Proposed
Date: 2026-05-15 UTC
Last updated: 2026-05-15 UTC (third amend — codename Watchpost → EyeTok; FLAG_CONSOLE_INCIDENTS locked)
Author: raxx-software-architect
Request source: Operator morning standup 2026-05-15 UTC — Kristerpher Henderson
Companion ADR: docs/architecture/adr/0091-alerting-source-and-oncall-agent.md
Launch target: 2026-05-23 UTC (T-8 days at time of writing)


Operator-confirmed decisions (2026-05-15 UTC)

The following decisions were locked by Kristerpher Henderson after initial design review. They supersede any earlier proposal in this document.

  1. Severity-based channel routing (4 channels). Slack channel assignment is strictly severity-driven. The original #ops-alerts / #ops-triage / #ops-digest proposal is replaced by four dedicated channels — see Section 6.

  2. ET 08:00–18:30 business hours define SEV2 vs SEV2.5 split. High-severity alerts during business hours post to #raxx-ops-alert-sev2-5 (no mobile-push override); outside business hours they post to #raxx-ops-alert-sev2 (mobile push + call-out). UTC translations: EDT window (Mar–Nov) is 12:00–22:30 UTC weekdays; EST window (Nov–Mar) is 13:00–23:30 UTC weekdays. Weekends always route to SEV2 regardless of time.

  3. PagerDuty deferred to v1.1. Pre-launch: Sentry → Slack with channel-specific mobile push override for SEV1/SEV2. Operator's phone push handles wake-up. PagerDuty Solo ($21/mo) is added only if a real after-hours SEV1 is missed.

  4. SEV3 = agent autonomous, but cannot self-escalate. Agent triages and remediates medium/low incidents without human ping. If the agent determines mid-triage that the incident is worse than scored, it posts escalation_recommended: SEV2 to the SEV3 channel and waits for operator ack — it does not self-promote.

  5. FreeScout mailbox confirmed: ops@raxx.app. No new mailbox needed. On-call tickets land in the existing ops@raxx.app mailbox. Unblocks SC-E5.

  6. Slack channel IDs confirmed. All four channels exist in the MooseQuest workspace. IDs locked 2026-05-15 UTC — see Section 6.

  7. Codename: EyeTok. The on-call cloud agent service is codenamed EyeTok (phonetic callback to ITOC, operator's first job). Locked 2026-05-15 UTC. Service identifier: eyetok (lowercase) for Heroku app names, vault paths, env vars, and log fields.

  8. FLAG_CONSOLE_INCIDENTS = gated at launch. /console/incidents is gated behind FLAG_CONSOLE_INCIDENTS (default off). Flip on once EyeTok is verified live in prod. The SC-E10 PR must include a B1 promotion migration for this flag in the same PR per the new-flag-needs-b1-migration-same-pr convention.


1. Context

Kristerpher needs a tireless first-responder between himself and the noise of a live trading platform. The on-call cloud agent is that responder: it watches for incident-grade signals, reads existing runbooks, posts structured triage to Slack, and keeps a ticket trail in FreeScout — so the operator wakes to context, not chaos.

Scope of this document: - Alerting source selection (Sentry vs Splunk vs layered) - Cloud agent runtime, identity, and tool surface - Slack engagement protocol and quiet-hour posture - SOP/runbook integration - FreeScout ticket lifecycle - Console severity wiring (/console/incidents) - Pre-launch MVP boundary vs post-launch phases

Not in scope: customer-facing alerting, market-data SLA monitoring, financial audit events (those flow through the existing KMS hash-chain audit trail per ADR-0022).


2. Invariants

The following constraints are non-negotiable for every design choice in this document.

  1. No stored credentials. The agent identity holds tokens via the Anthropic Managed Agents Vault or Infisical. It never persists credentials in code, config files, or agent memory.
  2. Audit trail for every state change. Every agent decision (triage conclusion, proposed action, escalation) writes a structured log row linked to the incident ticket.
  3. Agent does not autonomously execute customer-facing actions. Per the platform's deterministic-execution posture, the agent proposes; the operator decides. The agent may read, post, and create tickets — it never fires orders, modifies RBAC, or changes live-trading flags without explicit operator confirmation.
  4. GDPR by default. If the agent processes user PII (e.g., a log line containing an email address), it must redact before storing or posting to Slack. Runbook context is not PII-bearing; log snippets shared in Slack must be sanitised at the ingestion step.
  5. Digest posture pre-launch. New low/medium alert paths are digest-batched until 2026-05-23 UTC per the existing notification posture (docs/architecture/ci-notification-posture.md). Critical/High fire immediately, always.
  6. Secrets in vault, not files. Agent credentials for Slack, Sentry, FreeScout, and GitHub sit in the Anthropic Managed Agents Vault (reference: reference_claude_secrets_vault.md). None shipped in this repo.
  7. Paper-first gating is not the agent's concern. The agent observes signals and proposes; it does not bypass or override paper-profitable gates.

3. Alerting Source Decision

3.1 Candidates

Sentry (provisioned, active): application-layer error and performance capture across Raptor, Antlers, Console, and Queue. Captures exceptions, slow transactions, and custom events. Has native Slack integration and webhook-based alert rules. Latency from event to alert: typically 15–60 seconds. Already receiving live events as of Queue DSN wiring (#2142).

Splunk (operator-run EC2 instance): log-aggregation and search. Strong query language (SPL), supports threshold and anomaly alerts. Covers infrastructure-layer signals that Sentry cannot: disk, memory, network, H12s outside Sentry SDK scope. Alert delivery requires either Splunk Cloud (paid) or a self-hosted alerting scheduler (webhook action from Splunk saved searches). Latency from event to alert: 1–5 minutes depending on scheduled search frequency.

3.2 Coverage comparison

Signal class Sentry Splunk
Unhandled exceptions (Raptor / Queue / Console) Yes — primary Via log forwarding only
Slow transactions / P99 degradation Yes (performance) With metric extraction
H12 request timeout (Heroku dyno) Partial (if SDK captures) Yes (router log)
5xx rate spike Yes (issue alert) Yes (SPL threshold)
Memory / CPU / disk on EC2 No Yes
Custom business events (e.g., paper gate breach) Yes (custom capture) Yes (log event)
Infra cold-start / deploy smoke No Yes
Log-level patterns (WARN flood, audit gap) No Yes

3.3 Recommendation: Sentry-primary, Splunk-secondary (deferred)

Decision: Sentry is the v1 alerting source. Splunk is a valid secondary layer for infrastructure-level signals but adds operational burden (Splunk forwarder config, saved-search maintenance, EC2 uptime dependency) that is disproportionate to v1 scale (~10 users). Sentry covers the classes of signals most likely to cause user-visible incidents at launch. Splunk can be added in v1.1 if infrastructure-layer blind spots materialise.

Pre-launch readiness: Sentry is already wired. Slack integration is one toggle in the Sentry project settings + one alert rule. It can be live in hours. A Splunk forwarder pipeline would take 1–3 days to design and test correctly.

Cost: Sentry Team plan is ~$26/mo for the first 10K errors/mo. At v1 scale this is sufficient. Splunk self-hosted on EC2 costs only the instance time (already paid) but Splunk Cloud alerting adds ~$150+/mo — not justified at v1.

3.4 Severity mapping

Source signal Mapped severity Slack channel Ticket created?
Sentry: CRITICAL issue (level=fatal, affecting >10% of sessions) severity:critical #raxx-ops-alert-sev1 (immediate) Yes — P1
Sentry: HIGH issue (unhandled exception, production, new) severity:high #raxx-ops-alert-sev2 or #raxx-ops-alert-sev2-5 (business-hours split — see Section 6) Yes — P2
Sentry: MEDIUM (regression, performance degraded, spike) severity:medium #raxx-ops-alert-sev3 Yes — P3
Sentry: LOW (warning-level, non-regression) severity:low #raxx-ops-alert-sev3 (batched) Optional
Sentry: INFO (deploy markers, synthetic pass) severity:info agent audit log only No
Heroku H12 (via Sentry log integration) severity:high #raxx-ops-alert-sev2 / sev2-5 Yes — P2
Paper gate breach (custom Sentry event) severity:critical #raxx-ops-alert-sev1 Yes — P1
Synthetic check failure severity:high #raxx-ops-alert-sev2 / sev2-5 Yes — P2

Severity badges in Console mirror this table exactly (see Section 7).


4. Cloud Agent Architecture

4.1 Runtime location

Three options considered:

Option A — EC2 long-running process (operator's existing infra): A small Python process polls or listens for webhooks, calls the Claude API, and manages conversation state. Cheapest infra ($0 incremental if sharing Splunk EC2). Requires the operator to maintain the process (systemd, restarts, log rotation). On-call agent going down = no triage coverage. Risk: single point of failure at the worst time.

Option B — Anthropic cloud-hosted agent (Claude.ai managed sessions): Agent sessions run in Anthropic infrastructure. No infra to maintain. But as of 2026-05-15 UTC, the "Claude menu" / live SDK sessions capability is future work (per project_claude_menu_current_vs_future.md). Not available for v1.

Option C — Heroku worker dyno (existing Heroku estate): A small Python/Flask app deployed as a worker dyno on a new micro-app raxx-eyetok-prod. Receives webhooks from Sentry, calls Claude API, posts to Slack, creates FreeScout tickets. Heroku Standard dyno = ~$25/mo. Restarts on crash. No additional EC2 management. Heroku handles TLS for the webhook endpoint.

Decision: Option C — Heroku worker dyno. Integrates naturally with the existing Heroku estate. Webhook endpoint is a proper HTTPS URL Sentry can target. Crash = Heroku restarts within seconds. No new EC2 to manage. Agent identity and credentials follow the same vault pattern as other services.

The dyno app is codenamed EyeTok (phonetic callback to ITOC, operator's first job — locked 2026-05-15 UTC).

4.2 Agent identity and credentials

EyeTok holds the following credentials, all sourced from Infisical at dyno startup (never committed to the repo):

Credential Purpose Vault path
ANTHROPIC_API_KEY Claude API calls /Raxx/Eyetok/ANTHROPIC_API_KEY
SENTRY_WEBHOOK_SECRET Validate inbound Sentry webhooks (HMAC) /Raxx/Eyetok/SENTRY_WEBHOOK_SECRET
SLACK_BOT_TOKEN Post to Slack channels, reply in threads /Raxx/Eyetok/SLACK_BOT_TOKEN
SLACK_SIGNING_SECRET Validate inbound Slack events (operator replies) /Raxx/Eyetok/SLACK_SIGNING_SECRET
FREESCOUT_API_KEY Create / update tickets at tickets.raxx.app /Raxx/Eyetok/FREESCOUT_API_KEY
GITHUB_APP_TOKEN Read runbooks, file issues (via raxx-dev-bot) /Raxx/Eyetok/GITHUB_APP_TOKEN

Credentials are rotatable without redeploy (Infisical push → Heroku config:set → dyno restart). The ANTHROPIC_API_KEY rotation procedure is: new key in vault → heroku config:set ANTHROPIC_API_KEY=... -a raxx-eyetok-prod >/dev/null 2>&1 → old key revoked at Anthropic console.

4.3 Trigger model

sequenceDiagram
    participant Sentry
    participant EyeTok as EyeTok (raxx-eyetok-prod dyno)
    participant Claude as Claude API
    participant Slack
    participant FreeScout

    Sentry->>EyeTok: POST /webhooks/sentry (HMAC-signed)
    EyeTok->>EyeTok: validate HMAC, deduplicate (issue_id)
    EyeTok->>EyeTok: fetch matching runbook from docs/ops/runbooks/
    EyeTok->>Claude: invoke agent with alert context + runbook
    Claude-->>EyeTok: triage summary + proposed actions
    EyeTok->>Slack: post to SEV channel (severity + business-hours-routed)
    EyeTok->>FreeScout: create ticket (P1/P2/P3)
    EyeTok->>EyeTok: store incident_state{issue_id, thread_ts, ticket_id}

    Slack->>EyeTok: POST /webhooks/slack (operator reply in thread)
    EyeTok->>Claude: continue conversation with operator message
    Claude-->>EyeTok: follow-up response
    EyeTok->>Slack: reply in thread

Deduplication: EyeTok maintains an in-memory + Postgres-backed incident_state table keyed on Sentry issue_id. Re-fires within a 30-minute window do not spawn a new thread or ticket. They post a "re-fired (N)" update to the existing thread.

Low-severity batching: For severity:low and severity:info, EyeTok accumulates events in Postgres. SEV3 (severity:low) batches are included in the agent's SEV3 autonomous triage pass. severity:info is written to the agent audit log only.

4.4 Tool surface (MCPs)

The Claude agent inside EyeTok has access to the following tools:

Tool MCP What agent uses it for
GitHub file read GitHub MCP (raxx-dev-bot) Read runbooks at docs/ops/runbooks/ on each incident
Slack post / reply Slack MCP Post alerts, reply in threads
FreeScout ticket CRUD REST calls (no MCP) Create, update, close incident tickets
Sentry issue query Sentry REST API Fetch additional context (stack trace, affected users)
Console incidents write Internal REST endpoint Sync incident state to /console/incidents

FreeScout has no known MCP; the agent calls the FreeScout REST API directly using the API key (per project_freescout_api_key.md). Note the UI-only limitations (project_freescout_api_limits.md): auto-reply templates are UI-only and cannot be triggered via API. Tickets are created via POST to the conversations endpoint.

4.5 Memory and state

The agent does not carry cross-incident memory. Each incident is a fresh Claude invocation with the alert context + runbook text + conversation history for that Slack thread. This is intentional: cross-incident contamination would reduce triage precision.

Thread history (the running Slack conversation) is passed as context on each Slack reply, bounded to the last 20 exchanges to control token cost.

incident_state table (EyeTok Postgres):

Column Type Notes
incident_id UUID PK Generated on first fire
sentry_issue_id TEXT UNIQUE Deduplication key
severity TEXT critical/high/medium/low/info
sev_channel TEXT SEV1/SEV2/SEV2.5/SEV3 — actual channel routed at fire time
slack_thread_ts TEXT Slack thread timestamp for replies
slack_channel_id TEXT Slack channel ID (see Section 6.1)
freescout_ticket_id INT
status TEXT open/acknowledged/in_progress/resolved/escalated/escalation_recommended
agent_transcript JSONB Array of {role, content} pairs for audit
created_at TIMESTAMPTZ
resolved_at TIMESTAMPTZ Nullable
runbook_used TEXT Filename of matched runbook, or null

4.6 Cost estimate

Item v1 (~10 users) v1.1 (~100 users)
Heroku Standard-1X dyno (raxx-eyetok-prod) ~$25/mo ~$25/mo
Heroku Postgres (Mini) ~$5/mo ~$5/mo (same scale)
Claude API tokens ~$5–15/mo (est. 5–20 incidents/day × 2K tokens each) ~$20–50/mo
Sentry Team plan ~$26/mo ~$26/mo
PagerDuty Solo deferred to v1.1 ~$21/mo (if needed)
Total ~$61–71/mo ~$97–127/mo

Token cost assumes average incident = 1 runbook read (~500 tokens) + 1 triage response (~1K tokens) + 2–3 operator exchanges (~500 tokens each). Worst case at v1 is ~$30/mo if incident rate spikes.


5. SOP Integration

Runbooks live at docs/ops/runbooks/. The agent reads them at incident time, not on cold-start (cold-start read would bloat context unnecessarily).

5.1 Runbook matching

Each runbook file must include a YAML frontmatter block:

---
incident_class: [h12_timeout, sentry_fatal, paper_gate_breach, db_connection_pool, ...]
component: [raptor, antlers, queue, console, velvet]
severity_floor: high
---

EyeTok normalises the Sentry alert into {incident_class, component} and fetches the matching runbook via the GitHub API. If multiple runbooks match, all are included (ordered by severity_floor descending). If zero match, the agent proceeds with a "no runbook found" preamble and proposes a draft runbook as part of its response.

5.2 No-runbook path

When the agent finds no matching runbook: 1. Posts triage based on the raw alert signal. 2. Appends: "No matching runbook found for {incident_class}. Proposed draft runbook attached." 3. Files a separate GitHub issue tagged runbook-needed against the relevant component.

The operator's confirmation in-thread to accept the draft runbook triggers a PR from feature-developer. This closes the loop from incident → SOP.

5.3 SOP authorship loop

graph LR
    Incident -->|no runbook matched| EyeTokDraft[Agent proposes draft]
    EyeTokDraft -->|operator approves in thread| GHIssue[GH Issue: runbook-needed]
    GHIssue -->|card-groomer grooms| FeatureDev[feature-developer writes runbook]
    FeatureDev -->|PR merged| RunbookFile[docs/ops/runbooks/*.md]
    RunbookFile -->|next similar incident| EyeTok[Agent reads runbook]

6. Slack Engagement Protocol

6.1 Channel structure (locked 2026-05-15 UTC)

Four dedicated channels replace the original #ops-alerts / #ops-triage / #ops-digest proposal. Channel assignment is determined at fire time by the combination of severity and business-hours window.

Channel naming note: Slack normalises channel names to lowercase with hyphens only — capital letters are lowercased and periods are converted to hyphens. The SEV identifier (SEV1, SEV2, SEV2.5, SEV3) remains unchanged everywhere in code, docs, and discussions. Only the Slack channel string is affected by Slack's normalisation (e.g., SEV2.5 → channel name sev2-5). The sev_channel column in incident_state stores the canonical SEV identifier, not the Slack channel string.

SEV identifier Slack channel name Slack channel ID Mapped severity Agent behavior Notify posture
SEV1 #raxx-ops-alert-sev1 C0B423M38H4 severity:critical Post immediately, link to Emergency SOP, wait for human ack — no autonomous action Slack mobile push + sound override; v1.1: PagerDuty page
SEV2 #raxx-ops-alert-sev2 C0B445RU95Y severity:high OUTSIDE ET business hours Post + tag operator + propose remediation. Wait for human approval before any change Slack mobile push (call-out)
SEV2.5 #raxx-ops-alert-sev2-5 C0B4611UC2V severity:high DURING ET business hours Same as SEV2 but no mobile-push override (operator at desk) Slack normal notification
SEV3 #raxx-ops-alert-sev3 C0B4615LQ49 severity:medium + severity:low Agent triages + remediates autonomously. Posts outcome only. No human ping. Agent audit log only

6.2 ET business hours window

The SEV2 / SEV2.5 split is governed by Eastern Time (ET) business hours:

UTC translations: - EDT (approx. Mar–Nov): business hours = 12:00–22:30 UTC weekdays - EST (approx. Nov–Mar): business hours = 13:00–23:30 UTC weekdays - Weekend: always SEV2 regardless of wall-clock time

EyeTok must resolve the current ET offset at runtime (system timezone or a tzdata lookup) rather than hardcoding one UTC window, to handle the EDT/EST transition automatically.

6.3 SEV3 escalation invariant

The agent CANNOT autonomously promote a SEV3 incident to SEV2 or SEV1. If mid-triage the agent determines the incident is more severe than initially scored:

  1. It posts its updated assessment to the SEV3 channel with field escalation_recommended: SEV2 (or SEV1).
  2. It waits for operator acknowledgement of the promotion in-thread.
  3. Only after operator ack does EyeTok re-post to the appropriate SEV2/SEV1 channel and create the escalated ticket.

This invariant is enforced at the EyeTok application layer, not left to agent discretion.

6.4 PagerDuty deferral

PagerDuty is deferred to v1.1. Pre-launch the SEV1 and SEV2 mobile push override on the dedicated Slack channels is the wake-up mechanism. Rationale: operator's phone Slack push handles the practical requirement; adding PagerDuty Solo ($21/mo) before a real missed SEV1 demonstrates the need is premature overhead.

Trigger for PagerDuty adoption: a real after-hours SEV1 incident is missed because Slack push was silenced or ignored. At that point the operator adds PagerDuty without a design change — the EyeTok posting path is already SEV-channel-structured and compatible.

6.5 Alert message structure

[SEV1 / CRITICAL] Raxx — Unhandled exception in Queue (10 users affected)
Sentry: <https://sentry.io/...>
Ticket: <https://tickets.raxx.app/conversations/...>

Triage summary: Queue worker is throwing NullPointerException in the billing
settlement path after the Stripe webhook arrives with a missing metadata field.
Matches runbook: `queue-stripe-webhook-error.md`

Proposed action:
1. Confirm Stripe test-mode webhook is not misconfigured.
2. Restart Queue worker dyno (`heroku restart -a raxx-queue-prod`).
3. If restart does not clear: roll back to prior deploy.

Reply in this thread to engage agent.

Status updates posted to the same thread as the conversation progresses: - [ACKNOWLEDGED] — operator first reply - [IN PROGRESS] — operator has acted - [RESOLVED] — operator marks resolved; ticket closes - [ESCALATED] — no response in 15 min for SEV1, 60 min for SEV2/2.5 - [ESCALATION RECOMMENDED → SEV2] — agent-flagged mid-triage upgrade, awaiting operator ack

6.6 Operator engagement

The operator replies in the thread. EyeTok's Slack Events API subscription routes thread replies on #raxx-ops-alert-sev1, #raxx-ops-alert-sev2, and #raxx-ops-alert-sev2-5 to the Claude agent continuation. #raxx-ops-alert-sev3 posts are outcome-only; the operator can reply but it does not trigger agent continuation unless the operator explicitly asks.


7. Ticket Integration

Every incident at severity medium or above creates a FreeScout ticket. Low/info severity does not create tickets (digest-only).

Ticket fields:

FreeScout field Value
Subject [{SEV_CHANNEL}] {Sentry title}
Mailbox ops@raxx.app (confirmed — existing internal queue)
Tags eyetok-agent, {severity}, {sev_channel}, {component}
Custom: incident_severity maps to existing FreeScout custom field (runbook: freescout-status-fields.md)
Custom: component_tag e.g. queue, raptor
Body Agent triage summary + Slack thread link + Sentry link

Ticket lifecycle: - Created: on first Sentry webhook fire (severity >= medium) - Updated: each agent status post (acknowledged, in progress, resolved) writes a FreeScout note via API - Closed: when operator posts "resolved" in Slack thread, EyeTok closes the FreeScout ticket via API

Audit trail: incident_state.agent_transcript (JSONB) is the structured record of what the agent observed, proposed, and what the operator chose. This is retained for 90 days, consistent with audit log retention policy.


8. Console Integration — Severity Wiring

Operator asked: "make sure that all gets wired into Console somehow."

8.1 /console/incidents page (v1 — read-only, gated behind FLAG_CONSOLE_INCIDENTS)

A new console page surfaces all open incidents. Gating: the page is behind FLAG_CONSOLE_INCIDENTS (default off). The SC-E10 PR must ship the flag + a B1 promotion migration in the same PR per the new-flag-needs-b1-migration-same-pr convention. The flag is flipped on once EyeTok is verified live in prod.

The page shows: - Severity badge (SEV1 / SEV2 / SEV2.5 / SEV3 — reuses existing badge component conventions from Console) - Sentry issue link - FreeScout ticket link - Slack thread link - Agent triage summary (truncated to 3 lines; expandable) - Status chip: open / acknowledged / in-progress / resolved / escalated / escalation-recommended

Data source: EyeTok exposes a REST endpoint GET /api/incidents (authenticated via Console's internal service token). Console polls every 60 seconds.

8.2 Write-paths (post-v1)

In v1 the console page is read-only — acknowledge, escalate, resolve happen in Slack. Post-v1 the Console can add action buttons that POST back to EyeTok, which then propagates to Slack thread + FreeScout. This keeps the audit trail centrally owned by EyeTok.

8.3 Severity badge component

Reuses the severity badge convention from the existing status page (ADR-0030 state machine colours). Mapping:

Severity / Channel Badge colour Label
SEV1 / critical Red SEV1
SEV2 / high (off-hours) Orange SEV2
SEV2.5 / high (business hours) Amber SEV2.5
SEV3 / medium + low Yellow SEV3
info Grey INFO

9. Pre-Launch vs Post-Launch Staging

T-8 days to launch (2026-05-23 UTC).

Pre-launch MVP (ship by 2026-05-21 UTC)

  1. Sentry Slack integration enabled natively (toggle in Sentry UI, no code) — alert rules for fatal/error at severity:critical and severity:high#raxx-ops-alert-sev1 / #raxx-ops-alert-sev2. This is the zero-code baseline.
  2. EyeTok dyno scaffolded: accepts Sentry webhooks, routes to SEV channels (including business-hours SEV2 vs SEV2.5 split), creates FreeScout tickets at ops@raxx.app, posts to Slack with agent triage. No SOP reading yet.
  3. incident_state Postgres table provisioned (including sev_channel column).
  4. /console/incidents page: stub — shows count of open P1/P2 tickets from FreeScout API. Full incident view is post-launch. Gated behind FLAG_CONSOLE_INCIDENTS (default off).

This MVP can be shipped in 3–4 days by a single feature-developer.

Post-launch v1.1 (within 2 weeks of launch)

Post-launch v1.2 (within 4 weeks of launch)


10. Migrations

The incident_state table and the agent_transcript JSONB column are new — no existing schema is affected. EyeTok is a new Heroku app; it does not touch Raptor, Queue, or Console databases directly.

Console gains a new page (/console/incidents) backed by an outbound HTTP call to EyeTok — no DB migration in Console.

Rollback: disable the EyeTok dyno (heroku scale worker=0 -a raxx-eyetok-prod). Sentry native Slack integration (the zero-code baseline) continues to fire. No data is lost.


11. Security Considerations

Question Answer
What PII does this collect? Log excerpts from Sentry alerts may contain email addresses or user IDs. EyeTok must redact known PII patterns (email regex, UUID-to-user mapping) before posting to Slack or storing in agent_transcript.
Retention period? incident_state rows retained 90 days. agent_transcript same. Purge job runs weekly.
DSR erasure path? User incidents do not reference PII at the ticket level. If a user requests erasure and a ticket references their data, the agent_transcript JSONB is nulled for that row. Manual DSR handling runbook (docs/ops/runbooks/manual-dsr-handling.md) covers this.
Audit log? Every agent action is appended to agent_transcript with timestamp, role, and content. Redacted before storage.
Credential replay risk? No. Agent tokens are short-lived (Infisical dynamic secrets where possible) and never stored in the repo.
Breach notification? If EyeTok itself is compromised, the FREESCOUT_API_KEY and SLACK_BOT_TOKEN are rotatable in <5 minutes via Infisical → heroku config:set. Breach notification follows existing GDPR breach runbook.
Secrets location? All in Infisical under /Raxx/Eyetok/. Rotatable without redeploy.
Kill-switch? heroku scale worker=0 -a raxx-eyetok-prod shuts down EyeTok in seconds. Sentry native Slack alerts continue as fallback.

12. Open Questions (operator decision needed)

  1. ~~Slack workspace + channel names / IDs.~~ Resolved 2026-05-15 UTC. All four channels confirmed and created in the MooseQuest workspace. Channel IDs locked — see Section 6.1. SC-E2 is now unblocked.

  2. ~~Quiet-hour policy.~~ Resolved 2026-05-15 UTC. ET 08:00–18:30 weekdays. No explicit quiet-hour suppression — SEV1/SEV2 fire at all hours with mobile push. SEV3 is audit-log-only so quiet hours are moot for low/medium.

  3. Splunk forwarder scope: If infrastructure-layer signal gaps materialise post-launch, what EC2 instance does EyeTok target? Operator to provide Splunk instance details if/when v1.1 Splunk layer is activated. (Post-launch — no action needed before 2026-05-23 UTC.)

  4. ~~FreeScout mailbox for incidents.~~ Resolved 2026-05-15 UTC. Confirmed: ops@raxx.app (existing mailbox). No new mailbox required. SC-E5 is now unblocked.

  5. ~~Console incidents page gating.~~ Resolved 2026-05-15 UTC. FLAG_CONSOLE_INCIDENTS gated at launch (default off). SC-E10 PR must include B1 promotion migration in the same PR. Flip on once EyeTok is verified live in prod.


13. Sub-Cards for PM

# Title Labels Size Dependencies
SC-E1 Provision raxx-eyetok-staging / raxx-eyetok-prod Heroku dynos + Postgres + Infisical vault paths at /Raxx/Eyetok/ infra, eyetok, p1-launch S (1 day) None
SC-E2 Sentry alert rules → Slack native integration targeting #raxx-ops-alert-sev1 + #raxx-ops-alert-sev2 (zero-code baseline) alerting, sentry, p1-launch XS (2 hrs) Channels confirmed (done — IDs in Section 6.1)
SC-E3 EyeTok: webhook receiver + HMAC validation + SEV channel routing (including ET business-hours SEV2 vs SEV2.5 split) eyetok, backend, p1-launch M (2 days) SC-E1
SC-E4 EyeTok: Claude agent invocation + triage post to SEV channel; SEV3 escalation-recommended guard eyetok, ai, p1-launch M (2 days) SC-E3
SC-E5 EyeTok: FreeScout ticket create/update/close lifecycle (mailbox: ops@raxx.app) eyetok, ticketing, p1-launch M (2 days) SC-E1
SC-E6 EyeTok: Slack Events API subscription (operator thread reply → agent continuation on SEV1/SEV2/SEV2.5) eyetok, slack, p1-launch M (2 days) SC-E4
SC-E7 EyeTok: incident_state Postgres schema + sev_channel column + deduplication + low-severity batching eyetok, database S (1 day) SC-E1
SC-E8 Runbook frontmatter standard: add incident_class + component YAML to all existing runbooks docs, runbooks S (1 day) None
SC-E9 EyeTok: runbook matching + GitHub API fetch + no-runbook draft path eyetok, sop M (2 days) SC-E8, SC-E4
SC-E10 Console: FLAG_CONSOLE_INCIDENTS feature flag + B1 promotion migration + /console/incidents stub (count badge) — flag default off; flip on once EyeTok verified in prod console, frontend, p1-launch S (1 day) SC-E3
SC-E11 Console: /console/incidents full page — SEV badges (SEV1/SEV2/SEV2.5/SEV3), Slack links, ticket links, agent transcript console, frontend L (3 days) SC-E10, SC-E4, SC-E5
SC-E12 EyeTok: PII redaction layer for log excerpts before Slack post / transcript storage security, gdpr, eyetok M (2 days) SC-E4
SC-E13 EyeTok: ET business-hours runtime resolver for SEV2 vs SEV2.5 channel routing eyetok, ops S (1 day) SC-E3
SC-E14 incident_state 90-day purge job + DSR erasure hook gdpr, eyetok, compliance S (1 day) SC-E7

MVP for launch (2026-05-23 UTC): SC-E1, SC-E2, SC-E3, SC-E4, SC-E5, SC-E7, SC-E10, SC-E12.
Post-launch v1.1: SC-E6, SC-E8, SC-E9, SC-E11, SC-E13, SC-E14.