Raxx · internal docs

internal · gated

Operational health sweep — 2026-05-23 UTC

Sweep window: 2026-05-23T17:30–18:10 UTC
Author: sre-agent
Context: Personal-use launch-window opening day. CF Access stays on raxx.app. Demo-flow enablement chain (heroku-redis:mini, Redis SSL fix, FLAG_DEMO_SESSION/PAPER_FILL/SESSION_MIGRATE) shipped yesterday.


TL;DR

# Surface Status Note
1 Heroku apps (Raptor/Console/Velvet/Queue) Yellow raxx-queue-prod has no Procfile — zero dynos serving. All others up.
2 CF Pages (Antlers/getraxx/internal-docs) Yellow Deploy Antlers failed 2/7 last runs (2026-05-22 CE-skin wave) — last successful deploy 2026-05-21. HTTP 200 on raxx.app confirmed.
3 Lightsail / tickets.raxx.app Green HTTP 200, Postgres steady, no errors in logs.
4 AWS Lambda + API Gateway Green All 3 email functions Active, 0 invocations/errors last 24h, DLQs empty.
5 Postgres + Redis Green All DBs available, connections well under limits. Redis hit-rate=1, 0 evictions, ~4.7MB used.
6 CF Access + WAF Green 12 Access apps covering all required surfaces. Service-token and self-hosted policies in place.
7 GH Actions workflows Red 5 workflows failing — CI-main (gitleaks, 4 consecutive), Release (cascades from CI-main, 16 consecutive), Billing retention cron (exit code 3 = missing secrets, 5+ consecutive), Trace integrity cron (exit code 3 = missing secrets, 4+ consecutive), Terraform email-delivery-stack (AWS creds not loaded, 5+ consecutive). WAF Synthetic Probes 1/1 fail.
8 Sentry error rate Yellow Raptor: 9 unresolved issues. Redis-related cluster (4 issues, 19 events) is stale since 2026-05-22T19:00 UTC — resolved by v42. Alpaca symbol refresh (149 events, still active) is a pre-existing known issue.
9 Vault / vendor token TTL Green All critical vendor paths readable. CF tokens expire 2026-08-02 to 2026-08-17 — 70+ days out. Alpaca, Postmark, FreeScout, Heroku, GitHub, Sentry tokens present.
10 Cron / scheduled tasks Yellow Daily card groomer success. Nightly Security Scan failing (GH issue-filer step, 2+ consecutive) — issue #2708 already open. Billing retention + Trace integrity crashing on missing env secret. WAF probes 1/1 fail (details below).

Healthy surfaces (no action needed)


Yellow flags (worth tracking, not urgent)


Red flags (need attention today)

RED-1: CI — main blocked by gitleaks finding (4 consecutive failures, blocks all downstream)

Workflow: CI — main
Consecutive failures: 4 (since 2026-05-22T15:36 UTC)
Failing step: gitleaks full-history scan
Known issue: GitHub issue #2703 is open — [security] HIGH: gitleaks generic-api-key in console/tests/test_rbac_drift_1967.py
Cascade effect: Because CI — main fails on gitleaks (first job), all downstream jobs skip: Frontend tests, Backend tests, SAST (bandit), Dependency audits, OpenAPI drift check. The Release workflow is also failing (16 consecutive runs) because it depends on CI state or runs on the same triggers.

This is the highest-priority operational blocker today. Every merge to main fails CI. The fix PR (#2713 for nightly scan) cannot merge while CI is red. New feature work also cannot be validated.

Remediation path: The finding is in a test file (console/tests/test_rbac_drift_1967.py). Per feedback_bandit_in_tests_policy.md — test-file hardcoded credentials are test fixtures, not exploitable. The correct fix is to either: (a) add a .gitleaks.toml allowlist entry for that specific file+rule combination, or (b) rotate or scrub the test credential. This is a documented class of finding. Operator decision needed on whether to scrub/rotate or allowlist.

Operator action required: Review #2703 and either add gitleaks allowlist for the test file or scrub the test credential. SRE cannot modify test code (out-of-scope for SRE agent).


RED-2: Billing retention cron failing 5+ consecutive days

Workflow: Billing retention cron
Consecutive failures: 5+ (every run since at least 2026-05-19, all on main branch)
Failing step: Call billing-retention endpoint
Root cause: Exit code 3 from the curl step. The workflow logs show RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN are both empty strings in the runner environment — the GH Actions secrets are not set or have expired. The curl is executing against an empty URL (/api/internal/jobs/billing-retention) which returns non-200, triggering the exit 1 / error code path.

With FLAG_QUEUE_BILLING=0 set on raxx-api-prod, billing retention sweep is likely a no-op operationally at pre-launch. However the cron is failing silently every day and the ops notification job fires every day — this is active toil.

Operator action required: Set RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN as GH Actions repository secrets (or confirm they should be set from vault). This is a secrets-provisioning step, not a code change. File operator-action card.


RED-3: Trace integrity cron failing 4+ consecutive days

Workflow: Trace integrity cron
Consecutive failures: 4 (since 2026-05-20)
Failing step: Call trace-integrity-check endpoint
Root cause: Same pattern as RED-2 — RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN empty in runner. Exit code 3.

Trace integrity is KMS-hash-chain audit per project_kms_audit_chain_approved.md (operator approved ~$2/mo). If this is silently failing, the audit chain is not being validated. This is a SEV-3 compliance gap for an approved security control.

Operator action required: Same fix as RED-2 — provision RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN as GH Actions secrets. Both creons likely share the same two secrets.


RED-4: Terraform email-delivery-stack failing 5+ consecutive days

Workflow: Terraform — email-delivery-stack
Consecutive failures: 5 (every run since 2026-05-19)
Failing step: Configure AWS credentials (plan role)
Root cause: "Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers" — the AWS OIDC or static credential that this workflow uses to assume the Terraform plan IAM role is not configured or has expired. The workflow never reaches terraform plan or apply.

This means any Terraform drift on the email delivery stack (Lambda + SQS + API Gateway) is not being detected or applied. Given Postmark was approved out of sandbox (project_postmark_approved.md), this stack is in the active path.

Operator action required: Verify the GH Actions AWS credentials (likely an OIDC trust or AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY secret) are set correctly. If the IAM role/trust policy changed, re-configure. Escalating: this requires AWS IAM access which may need the claude-infisical-bootstrap user or operator break-glass. File operator-action card.


Sentry Raptor — demo-flow errors clarification

Four Sentry issues related to the demo-flow Redis deployment yesterday are unresolved but stale:

Issue Events First seen Last seen
demo: Redis client init failed (No module named 'redis') 1 18:36 UTC 05/22 18:36 UTC 05/22
demo.create_session: Redis client not found in app.extensions 5 18:37 UTC 05/22 18:44 UTC 05/22
demo.create_session: Redis unavailable — rate-limit check 4 18:56 UTC 05/22 19:00 UTC 05/22
Response: {"request_id": "1bd6710a... (demo endpoint 5xx) 9 18:37 UTC 05/22 19:00 UTC 05/22

These correlate precisely with the Redis provisioning + SSL fix deployment window. v42 (a4b8530d, 19:09 UTC) resolved them — zero events since. The Response: issue is likely the Sentry capture of the 503s during the brief outage window. Redis is currently healthy (hit-rate=1, 0 evictions, TLS active). These four issues should be marked resolved in Sentry once operator confirms the demo flow works end-to-end today.


Pending from brief context (not re-examined)


Action items

# Action Owner Due Severity
1 Resolve gitleaks finding in console/tests/test_rbac_drift_1967.py — allowlist or scrub test credential to unblock CI-main operator + feature-dev 2026-05-23 RED
2 Provision RAPTOR_INTERNAL_API_URL + ADMIN_SERVICE_TOKEN as GH Actions repository secrets to fix billing-retention + trace-integrity creons operator 2026-05-23 RED
3 Investigate and fix AWS credential configuration for Terraform email-delivery-stack workflow operator 2026-05-24 RED
4 Confirm raxx-queue-prod no-Procfile state is intentional (pre-launch gating) or trigger Queue prod deployment operator 2026-05-23 YELLOW
5 Bulk-resolve stale Sentry Raptor issues last seen before 2026-05-15 (WebAuthn, IntegrityError, db locked, SECRET_KEY) sre-agent 2026-05-26 YELLOW
6 Investigate WAF Synthetic Probes failure (only 1 run, 1 failure — need diagnosis) sre-agent 2026-05-26 YELLOW
7 After operator confirms demo flow works end-to-end: resolve 4 stale Redis Sentry issues from deployment window sre-agent today post-test YELLOW
8 Merge PR #2713 (nightly scan resilience) after CI-main unblocks sre-agent 2026-05-23 YELLOW