Raxx · internal docs

internal · gated ↑ index

Velvet First-Deploy Triage — 2026-05-05

Status: RESOLVED — both staging and prod deployed, migrations applied, healthz passing Last updated: 2026-05-05 04:15 UTC Author: sre-agent


Summary

Velvet (credential-rotation service) first deploy completed across staging and prod. Three sequential blockers required three hotfix PRs before migrations could run. Both environments are now healthy with all 3 DB migrations applied.

Related issues: #1129 #1130 #1136 #1137 #1140


Final state

Environment App Release Slug Migrations Healthz
Staging raxx-velvet-staging v15 2ef77cf3 3 applied {"service":"velvet","status":"ok"}
Prod raxx-velvet-prod v9 2ef77cf3 3 applied {"service":"velvet","status":"ok"}

Tables in both DBs: rotation_jobs, rotation_job_consumers, velvet_schema_migrations


Action queue

# Action Status Notes
1 git pull origin main (get -e . fix from #1138) DONE HEAD 9ded331
2 git subtree push --prefix velvet heroku-velvet-staging main (v14, -e . fix) DONE Staging v14, slug 86ceec1d
3 Verify staging boot / healthz DONE {"service":"velvet","status":"ok"} — 200 OK
4 heroku run python -m velvet.db.migrate (staging) BLOCKED → FIXED ModuleNotFoundError: No module named 'sqlalchemy' — fixed by PR #1141
5 Pull sqlalchemy fix (PR #1141, commit 98f2418) DONE git fetch origin main + deploy-temp branch
6 Subtree push staging with sqlalchemy fix DONE Staging v15, slug 2ef77cf3
7 Run migrations on staging DONE 3 migrations applied cleanly
8 Verify staging migration tables DONE ['rotation_job_consumers', 'rotation_jobs', 'velvet_schema_migrations']
9 Staging healthz smoke test DONE {"service":"velvet","status":"ok"} — 200
10 Subtree push to prod DONE Prod v9, slug 2ef77cf3 (same as staging)
11 Run migrations on prod DONE 3 migrations applied cleanly
12 Verify prod migration tables DONE ['rotation_job_consumers', 'rotation_jobs', 'velvet_schema_migrations']
13 Prod healthz smoke test DONE {"service":"velvet","status":"ok"} — 200

Staging deploy (final)

Migration output (verbatim):

2026-05-05 04:10:37,530 INFO __main__: Applying migration: 001_create_rotation_jobs_v2.sql
2026-05-05 04:10:37,618 INFO __main__: Applied migration: 001_create_rotation_jobs_v2.sql
2026-05-05 04:10:37,619 INFO __main__: Applying migration: 002_create_rotation_job_consumers.sql
2026-05-05 04:10:37,642 INFO __main__: Applied migration: 002_create_rotation_job_consumers.sql
2026-05-05 04:10:37,643 INFO __main__: Applying migration: 003_indexes.sql
2026-05-05 04:10:37,661 INFO __main__: Applied migration: 003_indexes.sql
2026-05-05 04:10:37,662 INFO __main__: Migrations complete.
Applied 3 migration(s): 001_create_rotation_jobs_v2.sql, 002_create_rotation_job_consumers.sql, 003_indexes.sql

Prod deploy (final)

Migration output (verbatim):

2026-05-05 04:13:13,991 INFO __main__: Applying migration: 001_create_rotation_jobs_v2.sql
2026-05-05 04:13:14,088 INFO __main__: Applied migration: 001_create_rotation_jobs_v2.sql
2026-05-05 04:13:14,089 INFO __main__: Applying migration: 002_create_rotation_job_consumers.sql
2026-05-05 04:13:14,116 INFO __main__: Applied migration: 002_create_rotation_job_consumers.sql
2026-05-05 04:13:14,117 INFO __main__: Applying migration: 003_indexes.sql
2026-05-05 04:13:14,146 INFO __main__: Applied migration: 003_indexes.sql
2026-05-05 04:13:14,146 INFO __main__: Migrations complete.
Applied 3 migration(s): 001_create_rotation_jobs_v2.sql, 002_create_rotation_job_consumers.sql, 003_indexes.sql

Blockers encountered

Blocker 1: missing velvet package registration (-e .)

PR: #1138 — closed by #1136 Error: ModuleNotFoundError: No module named 'velvet' — gunicorn crash on every boot Root cause: Heroku-24 buildpack runs only pip install -r requirements.txt, does not run pip install . even when setup.cfg is present. Without -e . in requirements.txt, the velvet package was never registered in site-packages. Fix: Added -e . as last line of velvet/requirements.txt.

Blocker 2: missing sqlalchemy dependency

PR: #1141 — closed by #1140 Error: ModuleNotFoundError: No module named 'sqlalchemy' — migration runner crash Root cause: velvet/db/__init__.py imports velvet.db.models at package-import time. velvet/db/models.py imports sqlalchemy. Neither velvet/requirements.txt nor velvet/setup.cfg install_requires listed sqlalchemy. Fix: Added sqlalchemy>=2.0,<3.0 to both files.


Sequence log (all times UTC)

Time (UTC) Action Outcome
2026-05-05 01:09 PR #1128 merged — setup.cfg + psycopg2-binary First subtree push attempted
03:09 v11–v13 deploys (pre--e . slug 2d03871a) Worker crash loop — No module named 'velvet'
03:43 Last crash before fix Dyno state: crashed
03:50 v14 deploy — subtree push with -e . fix (PR #1138, main@9ded331) Slug 86ceec1d — gunicorn boots, healthz 200
03:52 heroku run python -m velvet.db.migrate on staging BLOCKED — No module named 'sqlalchemy'
03:53 Escalation filed; prod deploy gated
04:06 PR #1141 merged — sqlalchemy fix main@98f2418
04:09 Subtree push to staging with sqlalchemy (temp branch → Heroku main) v15 deployed, slug 2ef77cf3
04:10 Migrations on staging 3 applied in 131ms
04:10 Healthz staging {"service":"velvet","status":"ok"} — 200
04:12 Subtree push to prod (same slug 2ef77cf3) v9 deployed
04:13 Migrations on prod 3 applied in 155ms
04:13 Healthz prod {"service":"velvet","status":"ok"} — 200
04:15 Deploy chain complete Both envs healthy

Total wall-clock time from PR #1128 merge → prod healthy: ~3h 6m (01:09 UTC merge → 04:15 UTC prod confirmed)


Issues status

Issue Action Status
#1129 CI smoke test for velvet OPEN — unblocked; test can now run against live DB
#1130 Release phase (migrate in Procfile) OPEN — hold for next sprint
#1136 -e . fix CLOSED — merged PR #1138, deployed in v14
#1137 Post-deploy healthz CI check OPEN — hold for next sprint
#1140 sqlalchemy missing from requirements CLOSED — merged PR #1141, deployed in v15/v9

Operational notes

Heroku subtree push branch behavior

git subtree push --prefix <dir> <remote> <local-branch> pushes the subtree split to a remote branch of the same name as <local-branch>. If that name is not main or master, Heroku skips the build. The reliable pattern is:

# Compute subtree split hash
SPLIT=$(git subtree split --prefix velvet HEAD)
# Push directly to Heroku's main
git push heroku-velvet-staging $SPLIT:refs/heads/main

Or use a local branch named main:

git subtree push --prefix velvet heroku-velvet-staging main

(This works when your current local branch is also named main.)

Migration idempotency

The velvet_schema_migrations table tracks applied migrations. Re-running python -m velvet.db.migrate is safe — already-applied migrations are skipped. Confirmed by design; not tested in this session.

heroku run syntax

The correct flag syntax for non-TTY one-shot commands:

heroku run --app <appname> --no-tty "<command>"

(Not heroku run -a <appname> <cmd> — the -a flag does not exist in the current CLI version; it reports "Nonexistent flag: -a".)


References


Addendum — DKIM TXT record remediation (#1144)

Time: 2026-05-05 04:27 UTC Severity: P2 HIGH (pre-condition for Postmark sandbox exit) Issue: https://github.com/moosequest/TradeMasterAPI/issues/1144 Author: sre-agent

Finding

Security-agent triage (2026-05-05) confirmed pm._domainkey.raxx.app returned NXDOMAIN on 8.8.8.8. The canonical Postmark DKIM selector was absent from the Cloudflare DNS zone for raxx.app, even though a date-stamped selector (20260430051323pm._domainkey.raxx.app) was already present and verified by Postmark.

Pre-state (before fix)

DNS state at 04:20 UTC:

20260430051323pm._domainkey.raxx.app  TXT  k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNA...  (PRESENT — CF record ID 83a6227b782ce931f97fd3d889ea28f6)
pm._domainkey.raxx.app                TXT  (ABSENT — NXDOMAIN)

Postmark domain state (before fix):

DKIMVerified: True   (against date-stamped selector only)
DKIMHost: 20260430051323pm._domainkey.raxx.app
DKIMUpdateStatus: Verified

Postmark API response (DKIMTextValue — authoritative)

Fetched from GET https://api.postmarkapp.com/domains/4616861 using X-Postmark-Account-Token from vault at /MooseQuest/postmark/POSTMARK_ACCOUNT_API_KEY:

DKIMHost:      20260430051323pm._domainkey.raxx.app
DKIMTextValue: k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCxX9MjFeCMegRCNnuM0DhSgLBL7WfOAISymao02MgPq20oXEQJILhSWQP9xJLz4Ie3aMJpJJXd9cKkRQb/rn6cMxTFUrgzyHIoznWTekXf5IU0orPm4tibKe9GZL0Rr+OxVwjcZttZ4modiJeCb+m1Yg2VGkdfrYSOxiCPwE4GAQIDAQAB

This is the same value used for both the date-stamped and canonical selectors (same key pair, both selectors point to the same public key).

Fix applied

Created TXT record in Cloudflare via DNS-edit token (CLOUDFLARE_EDIT_DNS):

Zone:    raxx.app (ID: f12dbb5cac57d5591a5058874498a6d1)
Name:    pm._domainkey.raxx.app
Type:    TXT
Content: k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCxX9MjFeCMegRCNnuM0DhSgLBL7WfOAISymao02MgPq20oXEQJILhSWQP9xJLz4Ie3aMJpJJXd9cKkRQb/rn6cMxTFUrgzyHIoznWTekXf5IU0orPm4tibKe9GZL0Rr+OxVwjcZttZ4modiJeCb+m1Yg2VGkdfrYSOxiCPwE4GAQIDAQAB
TTL:     1 (auto)
CF Record ID: e3963b1bf40bda34e11c99274915023c
Created: 2026-05-05T04:27:26.274254Z

API: POST https://api.cloudflare.com/client/v4/zones/f12dbb5cac57d5591a5058874498a6d1/dns_records

DNS verification

Checked at 04:27 UTC (~15s after record creation — Cloudflare propagated immediately):

$ dig +short TXT pm._domainkey.raxx.app @1.1.1.1
"k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCxX9MjFeCMegRCNnuM0DhSgLBL7WfOAISymao02MgPq20oXEQJILhSWQP9xJLz4Ie3aMJpJJXd9cKkRQb/rn6cMxTFUrgzyHIoznWTekXf5IU0orPm4tibKe9GZL0Rr+OxVwjcZttZ4modiJeCb+m1Yg2VGkdfrYSOxiCPwE4GAQIDAQAB"

$ dig +short TXT pm._domainkey.raxx.app @8.8.8.8
"k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCxX9MjFeCMegRCNnuM0DhSgLBL7WfOAISymao02MgPq20oXEQJILhSWQP9xJLz4Ie3aMJpJJXd9cKkRQb/rn6cMxTFUrgzyHIoznWTekXf5IU0orPm4tibKe9GZL0Rr+OxVwjcZttZ4modiJeCb+m1Yg2VGkdfrYSOxiCPwE4GAQIDAQAB"

Both public resolvers return the correct value. Local default resolver timed out (consistent with pre-existing behavior for all raxx.app DNS queries in this environment — not a propagation issue; the same timeout occurred for the pre-existing 20260430051323pm._domainkey record).

Postmark verification status

Postmark verifies DKIM against its own date-stamped selector (20260430051323pm._domainkey.raxx.app), not the canonical pm._domainkey selector. That date-stamped record was already present and verified before this remediation. Post-fix domain state:

DKIMVerified:      True
DKIMHost:          20260430051323pm._domainkey.raxx.app
DKIMUpdateStatus:  Verified
SPFVerified:       True
ReturnPathDomainVerified: True

The pm._domainkey canonical selector is now present in DNS, satisfying the security-agent finding and unblocking sandbox-exit verification by recipient mail providers that check standard Postmark selectors.

Records untouched

SPF (v=spf1 a mx include:spf.mtasv.net ~all) and DMARC (v=DMARC1; p=quarantine; rua=mailto:kris@moosequest.net; fo=1) were not modified. pm-bounces.raxx.app CNAME (pm.mtasv.net) was not modified.

Operator action required

Kristerpher to complete Postmark sandbox-exit approval in the Postmark dashboard. Postmark's internal sandbox review does not require any further DNS changes — all three records (SPF, DKIM via date-stamped selector, Return-Path CNAME) are verified.

Issue status

Issue #1144 closed — remediation applied, canonical selector live, no recurrence risk (both selectors now exist; Postmark's own selector was never missing).


Addendum — FREESCOUT_OPERATIONS_MAILBOX_ID provisioning

Time: 2026-05-05 (approx. 17:00 UTC) Operator action that preceded this: Kristerpher created the Operations mailbox in FreeScout Admin UI SOP: docs/ops/runbooks/freescout-operations-mailbox-provisioning.md Author: sre-agent

Mailbox verification

Retrieved via GET https://tickets.raxx.app/api/mailboxes with X-FreeScout-API-Key header (auth header is X-FreeScout-API-Key, not Authorization: Bearer — FreeScout uses a custom header).

Response confirmed two mailboxes:

id=1  name=Raxx Support   email=support@raxx.app
id=2  name=Operations     email=ops@raxx.app

Mailbox 2 ("Operations", ops@raxx.app) matches the expected operations pattern. HTTP 200.

Note: GET /api/mailboxes/2 returns HTTP 405. Mailbox lookup must use GET /api/mailboxes (list) and filter by ID. Runbook does not call this out explicitly — updating the understanding here for reference.

Infisical vault write

Heroku env var set

Both apps set with heroku config:set FREESCOUT_OPERATIONS_MAILBOX_ID=2 (stdout silenced per policy).

Readback verification:

App heroku config:get FREESCOUT_OPERATIONS_MAILBOX_ID Result
raxx-console-staging returned 2
raxx-console-prod returned 2

Dyno restarts

heroku dyno:restart -a raxx-console-staging  →  "Restarting all dynos... done"
heroku dyno:restart -a raxx-console-prod     →  "Restarting all dynos... done"

Both dynos confirmed up (web.* present in heroku ps output).

Health checks (post-restart)

URL HTTP code Interpretation
https://console-staging.raxx.app/health 302 CF Access redirect — healthy
https://console.raxx.app/health 302 CF Access redirect — healthy

Log-quiet check

Tailed raxx-console-staging logs for ~12 seconds after restart. Grep for FREESCOUT_OPERATIONS|missing.*mailbox|mailbox.*id|WARNING|WARN|startup|error|fatal returned zero matches. No startup warnings about the new env var.

Flags unchanged

FLAG_CONSOLE_INVESTIGATE_FROM_STATUS and FLAG_CONSOLE_ALERTS_AUTO_TICKET remain default-off. The env var is staged idle. Operator will flip flags via /console/flags (staging first, ~24h soak, then prod) per docs/ops/runbooks/auto-ticketing-rollout.md.

Post-provisioning checklist status

Addendum — #990 CONSOLE_CROSS_ENV_READ_TOKEN provisioning

Time: 2026-05-05 11:11 UTC Issue: https://github.com/moosequest/TradeMasterAPI/issues/990 Epic: https://github.com/moosequest/TradeMasterAPI/issues/988 (console self-deploy) Sub-card: S2 of the self-deploy chain (#988) Author: sre-agent

Summary

Provisioned CONSOLE_CROSS_ENV_READ_TOKEN — the service token that authenticates cross-env deploy status reads between the console apps and the CF Worker (console-deploy-shim). This is the last open sub-card of the #988 self-deploy epic. Token generation followed §7.4 of docs/architecture/console-self-deploy-web-layer.md.

What was done

Token generation

Infisical storage

Both paths updated from version 1 (placeholder) to version 2 (fresh token). Same token value in both paths — verified by SHA-256 hash comparison.

Infisical path Environment Version Status
/Console/prod/CONSOLE_CROSS_ENV_READ_TOKEN prod 2 PATCHED + verified
/Console/staging/CONSOLE_CROSS_ENV_READ_TOKEN prod 2 PATCHED + verified

Vault host: https://vault.raxx.app Project ID: 29b77751-f761-4afa-b3fa-2c842988f95c

Heroku config vars (silenced)

App Result
raxx-console-staging config:get hash match confirmed
raxx-console-prod config:get hash match confirmed

Commands used (stdout silenced per policy):

heroku config:set CONSOLE_CROSS_ENV_READ_TOKEN=<value> --app raxx-console-staging >/dev/null 2>&1
heroku config:set CONSOLE_CROSS_ENV_READ_TOKEN=<value> --app raxx-console-prod >/dev/null 2>&1

CF Worker secret

Dyno restarts

heroku ps:restart --app raxx-console-staging  →  done  (web.1: up ~12s after restart)
heroku ps:restart --app raxx-console-prod     →  done  (web.1: up ~10s after restart)

Endpoint verification (staging)

FLAG_CONSOLE_DEPLOY_XENV_READ temporarily enabled on staging for verification, then unset.

Test Authorization Expected Actual
Flag off Bearer valid-token 501 501
Valid token + unknown deploy_id (flag on) Bearer valid-token 404 404 {"error":"not_found"}
No token (flag on) none 401 401 {"error":"unauthorized"}
Wrong token (flag on) Bearer wrongtoken 401 401 {"error":"unauthorized"}

Endpoint under test: GET /api/internal/deploys/<id>/xenv Host: https://raxx-console-staging-58974c77617a.herokuapp.com

Rotation runbook

To rotate this token:

  1. openssl rand -hex 32 — generate new token.
  2. PATCH Infisical /Console/prod/CONSOLE_CROSS_ENV_READ_TOKEN (env: prod).
  3. PATCH Infisical /Console/staging/CONSOLE_CROSS_ENV_READ_TOKEN (env: prod, same value).
  4. heroku config:set CONSOLE_CROSS_ENV_READ_TOKEN=<new> --app raxx-console-staging >/dev/null 2>&1
  5. heroku config:set CONSOLE_CROSS_ENV_READ_TOKEN=<new> --app raxx-console-prod >/dev/null 2>&1
  6. printf '%s' "$NEW_TOKEN" | CLOUDFLARE_API_TOKEN="$CF_WORKER_DEPLOY" npx wrangler@4 secret put CONSOLE_CROSS_ENV_READ_TOKEN --name=console-deploy-shim
  7. CLOUDFLARE_API_TOKEN="$CF_WORKER_DEPLOY" npx wrangler@4 deploy --config infra/cf-workers/console-deploy-shim/wrangler.toml

Note: steps 4-5 trigger automatic dyno restarts (Heroku config change). Explicit ps:restart is not required.

What remains before round-trip testing (#994, #995)

Issue status

Issue #990 closed — token provisioned in all three locations (Infisical both paths, both Heroku apps, CF Worker secret). All three hold the same freshly generated token.