Trace Signing Key Rotation Runbook
Component: Raptor (raxx-api-prod / raxx-api-staging)
Feature flag: FLAG_TRACE_SYS_SIGNING
Issue: #515 (SC-12, epic #500)
Ops contact: ops@raxx.app
Overview
Every sys_* trace event is signed with an Ed25519 private key specific to
the subsystem that emitted it. Keys are stored in Infisical vault and surfaced
to the Raptor dyno as environment variables. The signing is gated by
FLAG_TRACE_SYS_SIGNING; when the flag is off, sig and sig_key_version
are null on all inserts.
Three subsystems each have their own key:
| Subsystem | Env var prefix |
|---|---|
mq-a:scheduler |
TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER |
raptor:order-router |
TRACE_SIGNING_PRIVKEY_RAPTOR_ORDER_ROUTER |
raptor:paper-gate |
TRACE_SIGNING_PRIVKEY_RAPTOR_PAPER_GATE |
raptor:trading-runtime |
TRACE_SIGNING_PRIVKEY_RAPTOR_TRADING_RUNTIME |
1. Initial provisioning (first-time setup)
Before enabling FLAG_TRACE_SYS_SIGNING, complete these steps for each
subsystem.
1a. Generate an Ed25519 keypair seed
The seed is a random 32-byte value encoded as base64url (URL-safe base64, no padding). Generate it locally with Python:
import os, base64
seed = os.urandom(32)
print(base64.urlsafe_b64encode(seed).rstrip(b"=").decode())
Do NOT use this output as a plain string anywhere except in Infisical vault. Do NOT commit it to any file.
1b. Store in Infisical
Vault path convention: /raxx/v1/{env}/raptor/TRACE_SIGNING_PRIVKEY_{SUBSYSTEM}_{VERSION}
Example for staging, mq-a scheduler, version v1:
/raxx/v1/staging/raptor/TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V1
Store the base64url seed value under that path. Use the Infisical web UI at
vault.raxx.app or the Infisical CLI.
Repeat for all three subsystems and both environments (staging and prod).
1c. Set dyno env vars
heroku config:set \
TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V1="<seed>" \
TRACE_SIGNING_PRIVKEY_RAPTOR_ORDER_ROUTER_V1="<seed>" \
TRACE_SIGNING_PRIVKEY_RAPTOR_PAPER_GATE_V1="<seed>" \
-a raxx-api-staging >/dev/null 2>&1
Repeat for -a raxx-api-prod with the prod seeds.
Do NOT echo the seeds to the terminal. Always append >/dev/null 2>&1 per
feedback_heroku_config_set_echoes_secrets.md.
1d. Verify
Enable the flag on staging first and run the integrity checker:
heroku config:set FLAG_TRACE_SYS_SIGNING=1 -a raxx-api-staging >/dev/null 2>&1
After a few minutes of traffic, run the integrity checker manually:
heroku run python -m jobs.trace_integrity_check --window-hours 1 -a raxx-api-staging
Expected output: status=pass. If you see SIGNATURE FAILURE, check that
the env vars are set correctly and the seed matches what is in Infisical.
2. Key rotation procedure
Rotation produces a new key version. Old events remain verifiable with the old key version. No service redeploy is required.
Step 1: Generate the new key seed (same as §1a above)
Step 2: Store the new seed in Infisical with the next version number
Example: rotating mq-a scheduler from v1 to v2 on prod:
/raxx/v1/prod/raptor/TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V2
Do NOT delete the v1 entry. The v1 entry must remain for verification of all events signed before the rotation.
Step 3: Add the new env var to the dyno
heroku config:set TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V2="<new seed>" \
-a raxx-api-prod >/dev/null 2>&1
Step 4: Switch the active version
heroku config:set TRACE_SIGNING_KEY_VERSION_MQA_SCHEDULER=v2 \
-a raxx-api-prod >/dev/null 2>&1
The running dyno picks up the new version within 5 minutes (key cache TTL). No restart required.
Step 5: Verify post-rotation
After 15 minutes (TTL + grace):
heroku run python -m jobs.trace_integrity_check --window-hours 1 -a raxx-api-prod
Expected: status=pass. If SIGNATURE FAILURE appears, check:
- Is TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V2 set on the dyno?
- Does the seed match the one stored in Infisical?
- Is TRACE_SIGNING_KEY_VERSION_MQA_SCHEDULER set to v2?
3. Responding to a signature failure in the breach pipeline
A SIGNATURE_BREACH alert from the nightly checker means:
- A
sys_*event row has ansigfield that does not verify with the recordedsig_key_versionand the public key derived from the stored seed.
This is distinct from a HASH_CHAIN_BREACH alert, which indicates a row was
tampered with or dropped from the chain (different root cause and remediation).
3a. Triage checklist
-
Check recent key rotation: Was a rotation performed recently? If the new version seed was set incorrectly, events signed with the bad seed will fail. Compare
TRACE_SIGNING_PRIVKEY_{SUBSYSTEM}_{VERSION}on the dyno against Infisical. -
Check for a revoked or wrong key version: If
sig_key_versionon the failing event references a version that no longer has a corresponding env var, the key lookup will fail (key_not_foundreason). Restore the seed from Infisical. -
Check for DB tampering: If neither of the above explains the failure, treat this as a potential DB compromise. Escalate per the incident response runbook.
3b. Alert severity mapping
| Alert type | Severity | Channel |
|---|---|---|
HASH_CHAIN_BREACH |
CRITICAL | #raxx-ops-alert-sev1 (EyeTok on-call) |
SIGNATURE_BREACH |
HIGH | #raxx-ops-alert-sev2 (EyeTok on-call) |
Both alerts send to ops@raxx.app.
A HASH_CHAIN_BREACH means events were tampered with or dropped — immediate
incident response required.
A SIGNATURE_BREACH could be a key misconfiguration or a genuine forgery.
Treat as HIGH until root cause is confirmed. If root cause is confirmed as
key misconfiguration, downgrade to ops action only (no incident).
3c. Isolating the breach
Query the failing event directly:
SELECT id, subsystem, sig_key_version, ts_emitted, context_json
FROM trace_events
WHERE id = '<failing_event_id>';
Compare sig_key_version to the available key versions on the dyno:
heroku config -a raxx-api-prod | grep TRACE_SIGNING
4. Distinguishing hash chain vs signature failures
The nightly checker (jobs/trace_integrity_check.py) raises separate alerts:
-
HASH_CHAIN_BREACH:hash_prevfor one or more events in a workflow chain does not matchSHA-256(canonical_json(previous_event)). This indicates a row was modified, inserted, or deleted after the chain was built. Severity: CRITICAL. -
SIGNATURE_BREACH: An Ed25519 signature on asys_*event does not verify. This indicates either key misconfiguration or a forged event row. Severity: HIGH.
These are always emitted as separate alerts with separate payloads so the ops
team can triage them independently. A single bad actor with DB write access
could trigger both; key misconfiguration triggers only SIGNATURE_BREACH.
5. Key material reference
| Item | Location |
|---|---|
| Private key seeds (prod) | Infisical /raxx/v1/prod/raptor/TRACE_SIGNING_PRIVKEY_* |
| Private key seeds (staging) | Infisical /raxx/v1/staging/raptor/TRACE_SIGNING_PRIVKEY_* |
| Active version (prod) | Heroku dyno TRACE_SIGNING_KEY_VERSION_* on raxx-api-prod |
| Active version (staging) | Heroku dyno TRACE_SIGNING_KEY_VERSION_* on raxx-api-staging |
| Signing service code | backend_v2/api/services/trace_signing_service.py |
| Integrity checker | backend_v2/jobs/trace_integrity_check.py |
6. Post-rotation audit trail
After every key rotation, add a row to the ops attestation log:
docs/ops/attestation-log/YYYY-MM-DD-trace-key-rotation.md
Include: - Date and time (UTC) - Subsystem(s) rotated - Old version → new version - Reason for rotation - Who performed the rotation - Verification result (pass/fail + checker output)