Operational health sweep — 2026-05-23 UTC
Sweep window: 2026-05-23T17:30–18:10 UTC
Author: sre-agent
Context: Personal-use launch-window opening day. CF Access stays on raxx.app. Demo-flow enablement chain (heroku-redis:mini, Redis SSL fix, FLAG_DEMO_SESSION/PAPER_FILL/SESSION_MIGRATE) shipped yesterday.
TL;DR
| # | Surface | Status | Note |
|---|---|---|---|
| 1 | Heroku apps (Raptor/Console/Velvet/Queue) | Yellow | raxx-queue-prod has no Procfile — zero dynos serving. All others up. |
| 2 | CF Pages (Antlers/getraxx/internal-docs) | Yellow | Deploy Antlers failed 2/7 last runs (2026-05-22 CE-skin wave) — last successful deploy 2026-05-21. HTTP 200 on raxx.app confirmed. |
| 3 | Lightsail / tickets.raxx.app | Green | HTTP 200, Postgres steady, no errors in logs. |
| 4 | AWS Lambda + API Gateway | Green | All 3 email functions Active, 0 invocations/errors last 24h, DLQs empty. |
| 5 | Postgres + Redis | Green | All DBs available, connections well under limits. Redis hit-rate=1, 0 evictions, ~4.7MB used. |
| 6 | CF Access + WAF | Green | 12 Access apps covering all required surfaces. Service-token and self-hosted policies in place. |
| 7 | GH Actions workflows | Red | 5 workflows failing — CI-main (gitleaks, 4 consecutive), Release (cascades from CI-main, 16 consecutive), Billing retention cron (exit code 3 = missing secrets, 5+ consecutive), Trace integrity cron (exit code 3 = missing secrets, 4+ consecutive), Terraform email-delivery-stack (AWS creds not loaded, 5+ consecutive). WAF Synthetic Probes 1/1 fail. |
| 8 | Sentry error rate | Yellow | Raptor: 9 unresolved issues. Redis-related cluster (4 issues, 19 events) is stale since 2026-05-22T19:00 UTC — resolved by v42. Alpaca symbol refresh (149 events, still active) is a pre-existing known issue. |
| 9 | Vault / vendor token TTL | Green | All critical vendor paths readable. CF tokens expire 2026-08-02 to 2026-08-17 — 70+ days out. Alpaca, Postmark, FreeScout, Heroku, GitHub, Sentry tokens present. |
| 10 | Cron / scheduled tasks | Yellow | Daily card groomer success. Nightly Security Scan failing (GH issue-filer step, 2+ consecutive) — issue #2708 already open. Billing retention + Trace integrity crashing on missing env secret. WAF probes 1/1 fail (details below). |
Healthy surfaces (no action needed)
- raxx-api-prod — web.1 up 22h, v43 current (FLAG_DEMO_PAPER_FILL + FLAG_DEMO_SESSION_MIGRATE set), 0 errors/crashes in recent logs.
- raxx-api-staging — web.1 up 4h, v558 current (e8034b49, 2026-05-23T13:37 UTC), 0 errors in logs.
- raxx-console-prod — web.1 up 21h, v103 current (FLAG_FLAG_DOCS_ENRICHMENT, 2026-05-21), 0 errors.
- raxx-console-staging — web.1 up 1h, v263 current (2026-05-21), 0 errors.
- raxx-velvet-prod — web.1 up 5h, v10 current (2026-05-10), 0 errors.
- raxx-velvet-staging — web.1 up 20h, v22 current (2026-05-20), 0 errors.
- raxx-queue-staging — web.1 up 9h, v14 current.
- Postgres (all apps) — all
Available, well under connection and storage limits. raxx-api-prod Standard-0: 12/200 connections, 10.2MB / 64GB. raxx-queue-prod Standard-0: 8/200 connections, 7.6MB / 64GB. - Redis (raxx-api-prod) — Mini plan,
available, hit-rate=1, 0 evicted keys, ~4.7MB used, TLS required and working. Sentry Redis errors (18:36–19:00 UTC 2026-05-22) are stale — resolved by v42 deploy at 19:09 UTC. No recurrence since. - CF Access perimeter — 12 apps covering raxx.app, api.raxx.app, console.raxx.app, vault.raxx.app, tickets.raxx.app, internal-docs.raxx.app, getraxx.com, raxx-app.pages.dev, raxx-mockups.pages.dev, and API staging. All
self_hostedtype except OIDC RP and Lambda service-token policy. - External HTTP checks — raxx.app, getraxx.com, tickets.raxx.app, vault.raxx.app all return HTTP 200.
- AWS Lambda —
raxx-email-inbound-webhook,raxx-email-inbound-bridge,raxx-email-inbound-authorizerallActive / Successful. 0 invocations, 0 errors last 24h. Both SQS DLQs at depth 0. - Vault connectivity — Infisical auth OK, all critical secret paths readable:
/MooseQuest/cloudflare(36 keys),/MooseQuest/alpaca(3 keys),/MooseQuest/postmark(4 keys),/MooseQuest/sentry(5 keys),/MooseQuest/freescout(15 keys),/MooseQuest/heroku(4 keys),/MooseQuest/github(3 keys). - CF token expiries — CF_CACHE_PURGE_TOKEN expires 2026-08-17, CF_DNS_EDIT_GETRAXX_COM + CF_WAF_EDIT_RAXX_APP expire 2026-08-02. All 70+ days out.
- Sentry trademaster-antlers — 0 unresolved issues.
- Sentry raxx-queue — 0 unresolved issues.
- Daily card groomer — success today at 10:22 UTC.
- CI Digest (daily) — success today at 08:56 UTC.
- Bot token smoke (daily) — success (2 of 2 runs).
- CF token usage lint — 5/5 success.
- OWASP ZAP Baseline Scan — passing (3 recent successes including today).
- Flag drift check (B2) — 12/12 success.
- FreeScout daily backup — 2/2 success.
- Queue Docker smoke — 18/18 success.
- Drift orchestrator cron — 2/2 success.
- Deploy to Heroku — 11/11 success (staging deploys clean).
- Deploy internal docs — last run today: success.
Yellow flags (worth tracking, not urgent)
- raxx-queue-prod: no Procfile, no dynos.
heroku ps:scalereturns "No process types on raxx-queue-prod." The app has config vars set (QUEUE_SERVICE_TOKEN_*, FLAG_QUEUE_BILLING=0) and a Standard-0 Postgres with 8 active connections, suggesting the DB is being accessed by staging but prod has not had a Procfile committed or a container image deployed. Postgres having active connections from somewhere (possibly heroku-postgres health checks or staging via RAPTOR_APP_DATABASE_URL alias) is worth confirming. This is pre-launch expected state if Queue prod deployment is intentionally deferred — but the absence of a Procfile means the Queue service is not serving any prod traffic. -
Recommended action: Confirm with operator whether raxx-queue-prod is intentionally undeployed (pre-launch gating) or missed deployment. If intended, no action. If missed, deploy the Queue container.
-
Deploy Antlers: 2 failures on 2026-05-22. The two failures (run IDs 26297070618 and 26297060936, branch main) both fail at "Publish to Cloudflare Pages." The most recent successful Deploy Antlers run was 2026-05-21T23:55 UTC. This is the CE-skin PR wave mentioned in the brief (SRE agent a55ba0e8efa9da10d already active on this). raxx.app serves HTTP 200 today — the issue is that the latest code from 2026-05-22 merges (CE-skin PRs) has not been published to CF Pages.
-
Recommended action: Do not duplicate work. Confirm with agent a55ba0e8efa9da10d whether the CF Pages publish failure is resolved. If it is still failing, the live raxx.app is serving the last successful deploy from 2026-05-21.
-
Sentry Raptor: pre-existing unresolved issues not cleaned up. Five issues from before 2026-05-22 remain unresolved in Sentry (WebAuthn config failure, IntegrityError UNIQUE constraint, database locked, SECRET_KEY not set in staging). These all last fired between 2026-04-24 and 2026-05-01 and are stale/staging artifacts. They should be resolved in Sentry to keep the error dashboard actionable.
-
Recommended action: File a reliability ticket to bulk-resolve stale Sentry issues last seen pre-2026-05-15. No immediate operational impact.
-
Nightly Security Scan: "File findings as GitHub issues" step failing 3+ consecutive runs. Issue #2708 is already open. The scan itself runs (scanners complete) but the issue-filer step fails. This is the known issue from the brief — PR #2713 open. The "Detect scan artifact gap" job also fails because it can't check out the repo (permissions issue on that separate job). This means scan results are being computed but not filed as issues — a detection gap per
feedback_nightly_scan_dark_is_high.md. -
Recommended action: Merge PR #2713 as soon as CI unblocks (see CI-main red flag below).
-
WAF Synthetic Probes: 1/1 failure (only 1 run ever). The probe job fails at "Run WAF synthetic probes" step. This workflow has no successful run in the dataset (1 attempt, 1 failure). The Slack DM-on-failure job succeeds, meaning the failure notification was sent. No second run to determine if this is a flake or a systematic probe failure. Last run was 2026-05-22T16:27 UTC — only one run in history.
- Recommended action: Inspect probe script against current WAF rules. One run is insufficient to distinguish flake from systematic. File a reliability ticket to investigate.
Red flags (need attention today)
RED-1: CI — main blocked by gitleaks finding (4 consecutive failures, blocks all downstream)
Workflow: CI — main
Consecutive failures: 4 (since 2026-05-22T15:36 UTC)
Failing step: gitleaks full-history scan
Known issue: GitHub issue #2703 is open — [security] HIGH: gitleaks generic-api-key in console/tests/test_rbac_drift_1967.py
Cascade effect: Because CI — main fails on gitleaks (first job), all downstream jobs skip: Frontend tests, Backend tests, SAST (bandit), Dependency audits, OpenAPI drift check. The Release workflow is also failing (16 consecutive runs) because it depends on CI state or runs on the same triggers.
This is the highest-priority operational blocker today. Every merge to main fails CI. The fix PR (#2713 for nightly scan) cannot merge while CI is red. New feature work also cannot be validated.
Remediation path: The finding is in a test file (console/tests/test_rbac_drift_1967.py). Per feedback_bandit_in_tests_policy.md — test-file hardcoded credentials are test fixtures, not exploitable. The correct fix is to either: (a) add a .gitleaks.toml allowlist entry for that specific file+rule combination, or (b) rotate or scrub the test credential. This is a documented class of finding. Operator decision needed on whether to scrub/rotate or allowlist.
Operator action required: Review #2703 and either add gitleaks allowlist for the test file or scrub the test credential. SRE cannot modify test code (out-of-scope for SRE agent).
RED-2: Billing retention cron failing 5+ consecutive days
Workflow: Billing retention cron
Consecutive failures: 5+ (every run since at least 2026-05-19, all on main branch)
Failing step: Call billing-retention endpoint
Root cause: Exit code 3 from the curl step. The workflow logs show RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN are both empty strings in the runner environment — the GH Actions secrets are not set or have expired. The curl is executing against an empty URL (/api/internal/jobs/billing-retention) which returns non-200, triggering the exit 1 / error code path.
With FLAG_QUEUE_BILLING=0 set on raxx-api-prod, billing retention sweep is likely a no-op operationally at pre-launch. However the cron is failing silently every day and the ops notification job fires every day — this is active toil.
Operator action required: Set RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN as GH Actions repository secrets (or confirm they should be set from vault). This is a secrets-provisioning step, not a code change. File operator-action card.
RED-3: Trace integrity cron failing 4+ consecutive days
Workflow: Trace integrity cron
Consecutive failures: 4 (since 2026-05-20)
Failing step: Call trace-integrity-check endpoint
Root cause: Same pattern as RED-2 — RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN empty in runner. Exit code 3.
Trace integrity is KMS-hash-chain audit per project_kms_audit_chain_approved.md (operator approved ~$2/mo). If this is silently failing, the audit chain is not being validated. This is a SEV-3 compliance gap for an approved security control.
Operator action required: Same fix as RED-2 — provision RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN as GH Actions secrets. Both creons likely share the same two secrets.
RED-4: Terraform email-delivery-stack failing 5+ consecutive days
Workflow: Terraform — email-delivery-stack
Consecutive failures: 5 (every run since 2026-05-19)
Failing step: Configure AWS credentials (plan role)
Root cause: "Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers" — the AWS OIDC or static credential that this workflow uses to assume the Terraform plan IAM role is not configured or has expired. The workflow never reaches terraform plan or apply.
This means any Terraform drift on the email delivery stack (Lambda + SQS + API Gateway) is not being detected or applied. Given Postmark was approved out of sandbox (project_postmark_approved.md), this stack is in the active path.
Operator action required: Verify the GH Actions AWS credentials (likely an OIDC trust or AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY secret) are set correctly. If the IAM role/trust policy changed, re-configure. Escalating: this requires AWS IAM access which may need the claude-infisical-bootstrap user or operator break-glass. File operator-action card.
Sentry Raptor — demo-flow errors clarification
Four Sentry issues related to the demo-flow Redis deployment yesterday are unresolved but stale:
| Issue | Events | First seen | Last seen |
|---|---|---|---|
demo: Redis client init failed (No module named 'redis') |
1 | 18:36 UTC 05/22 | 18:36 UTC 05/22 |
demo.create_session: Redis client not found in app.extensions |
5 | 18:37 UTC 05/22 | 18:44 UTC 05/22 |
demo.create_session: Redis unavailable — rate-limit check |
4 | 18:56 UTC 05/22 | 19:00 UTC 05/22 |
Response: {"request_id": "1bd6710a... (demo endpoint 5xx) |
9 | 18:37 UTC 05/22 | 19:00 UTC 05/22 |
These correlate precisely with the Redis provisioning + SSL fix deployment window. v42 (a4b8530d, 19:09 UTC) resolved them — zero events since. The Response: issue is likely the Sentry capture of the 503s during the brief outage window. Redis is currently healthy (hit-rate=1, 0 evictions, TLS active). These four issues should be marked resolved in Sentry once operator confirms the demo flow works end-to-end today.
Pending from brief context (not re-examined)
- Deploy Antlers failures (CE-skin wave): Delegated to agent a55ba0e8efa9da10d. Live raxx.app returns HTTP 200 but is serving pre-CE-skin build from 2026-05-21.
- Nightly security scan resilience PR #2713: Already open. Merge blocked by CI-main / gitleaks (RED-1 above).
Action items
| # | Action | Owner | Due | Severity |
|---|---|---|---|---|
| 1 | Resolve gitleaks finding in console/tests/test_rbac_drift_1967.py — allowlist or scrub test credential to unblock CI-main |
operator + feature-dev | 2026-05-23 | RED |
| 2 | Provision RAPTOR_INTERNAL_API_URL + ADMIN_SERVICE_TOKEN as GH Actions repository secrets to fix billing-retention + trace-integrity creons |
operator | 2026-05-23 | RED |
| 3 | Investigate and fix AWS credential configuration for Terraform email-delivery-stack workflow | operator | 2026-05-24 | RED |
| 4 | Confirm raxx-queue-prod no-Procfile state is intentional (pre-launch gating) or trigger Queue prod deployment | operator | 2026-05-23 | YELLOW |
| 5 | Bulk-resolve stale Sentry Raptor issues last seen before 2026-05-15 (WebAuthn, IntegrityError, db locked, SECRET_KEY) | sre-agent | 2026-05-26 | YELLOW |
| 6 | Investigate WAF Synthetic Probes failure (only 1 run, 1 failure — need diagnosis) | sre-agent | 2026-05-26 | YELLOW |
| 7 | After operator confirms demo flow works end-to-end: resolve 4 stale Redis Sentry issues from deployment window | sre-agent | today post-test | YELLOW |
| 8 | Merge PR #2713 (nightly scan resilience) after CI-main unblocks | sre-agent | 2026-05-23 | YELLOW |