Raxx · internal docs

internal · gated

Trace Signing Key Rotation Runbook

Component: Raptor (raxx-api-prod / raxx-api-staging)
Feature flag: FLAG_TRACE_SYS_SIGNING
Issue: #515 (SC-12, epic #500)
Ops contact: ops@raxx.app


Overview

Every sys_* trace event is signed with an Ed25519 private key specific to the subsystem that emitted it. Keys are stored in Infisical vault and surfaced to the Raptor dyno as environment variables. The signing is gated by FLAG_TRACE_SYS_SIGNING; when the flag is off, sig and sig_key_version are null on all inserts.

Three subsystems each have their own key:

Subsystem Env var prefix
mq-a:scheduler TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER
raptor:order-router TRACE_SIGNING_PRIVKEY_RAPTOR_ORDER_ROUTER
raptor:paper-gate TRACE_SIGNING_PRIVKEY_RAPTOR_PAPER_GATE
raptor:trading-runtime TRACE_SIGNING_PRIVKEY_RAPTOR_TRADING_RUNTIME

1. Initial provisioning (first-time setup)

Before enabling FLAG_TRACE_SYS_SIGNING, complete these steps for each subsystem.

1a. Generate an Ed25519 keypair seed

The seed is a random 32-byte value encoded as base64url (URL-safe base64, no padding). Generate it locally with Python:

import os, base64
seed = os.urandom(32)
print(base64.urlsafe_b64encode(seed).rstrip(b"=").decode())

Do NOT use this output as a plain string anywhere except in Infisical vault. Do NOT commit it to any file.

1b. Store in Infisical

Vault path convention: /raxx/v1/{env}/raptor/TRACE_SIGNING_PRIVKEY_{SUBSYSTEM}_{VERSION}

Example for staging, mq-a scheduler, version v1:

/raxx/v1/staging/raptor/TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V1

Store the base64url seed value under that path. Use the Infisical web UI at vault.raxx.app or the Infisical CLI.

Repeat for all three subsystems and both environments (staging and prod).

1c. Set dyno env vars

heroku config:set \
  TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V1="<seed>" \
  TRACE_SIGNING_PRIVKEY_RAPTOR_ORDER_ROUTER_V1="<seed>" \
  TRACE_SIGNING_PRIVKEY_RAPTOR_PAPER_GATE_V1="<seed>" \
  -a raxx-api-staging >/dev/null 2>&1

Repeat for -a raxx-api-prod with the prod seeds.

Do NOT echo the seeds to the terminal. Always append >/dev/null 2>&1 per feedback_heroku_config_set_echoes_secrets.md.

1d. Verify

Enable the flag on staging first and run the integrity checker:

heroku config:set FLAG_TRACE_SYS_SIGNING=1 -a raxx-api-staging >/dev/null 2>&1

After a few minutes of traffic, run the integrity checker manually:

heroku run python -m jobs.trace_integrity_check --window-hours 1 -a raxx-api-staging

Expected output: status=pass. If you see SIGNATURE FAILURE, check that the env vars are set correctly and the seed matches what is in Infisical.


2. Key rotation procedure

Rotation produces a new key version. Old events remain verifiable with the old key version. No service redeploy is required.

Step 1: Generate the new key seed (same as §1a above)

Step 2: Store the new seed in Infisical with the next version number

Example: rotating mq-a scheduler from v1 to v2 on prod:

/raxx/v1/prod/raptor/TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V2

Do NOT delete the v1 entry. The v1 entry must remain for verification of all events signed before the rotation.

Step 3: Add the new env var to the dyno

heroku config:set TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V2="<new seed>" \
  -a raxx-api-prod >/dev/null 2>&1

Step 4: Switch the active version

heroku config:set TRACE_SIGNING_KEY_VERSION_MQA_SCHEDULER=v2 \
  -a raxx-api-prod >/dev/null 2>&1

The running dyno picks up the new version within 5 minutes (key cache TTL). No restart required.

Step 5: Verify post-rotation

After 15 minutes (TTL + grace):

heroku run python -m jobs.trace_integrity_check --window-hours 1 -a raxx-api-prod

Expected: status=pass. If SIGNATURE FAILURE appears, check: - Is TRACE_SIGNING_PRIVKEY_MQA_SCHEDULER_V2 set on the dyno? - Does the seed match the one stored in Infisical? - Is TRACE_SIGNING_KEY_VERSION_MQA_SCHEDULER set to v2?


3. Responding to a signature failure in the breach pipeline

A SIGNATURE_BREACH alert from the nightly checker means:

This is distinct from a HASH_CHAIN_BREACH alert, which indicates a row was tampered with or dropped from the chain (different root cause and remediation).

3a. Triage checklist

  1. Check recent key rotation: Was a rotation performed recently? If the new version seed was set incorrectly, events signed with the bad seed will fail. Compare TRACE_SIGNING_PRIVKEY_{SUBSYSTEM}_{VERSION} on the dyno against Infisical.

  2. Check for a revoked or wrong key version: If sig_key_version on the failing event references a version that no longer has a corresponding env var, the key lookup will fail (key_not_found reason). Restore the seed from Infisical.

  3. Check for DB tampering: If neither of the above explains the failure, treat this as a potential DB compromise. Escalate per the incident response runbook.

3b. Alert severity mapping

Alert type Severity Channel
HASH_CHAIN_BREACH CRITICAL #raxx-ops-alert-sev1 (EyeTok on-call)
SIGNATURE_BREACH HIGH #raxx-ops-alert-sev2 (EyeTok on-call)

Both alerts send to ops@raxx.app.

A HASH_CHAIN_BREACH means events were tampered with or dropped — immediate incident response required.

A SIGNATURE_BREACH could be a key misconfiguration or a genuine forgery. Treat as HIGH until root cause is confirmed. If root cause is confirmed as key misconfiguration, downgrade to ops action only (no incident).

3c. Isolating the breach

Query the failing event directly:

SELECT id, subsystem, sig_key_version, ts_emitted, context_json
FROM trace_events
WHERE id = '<failing_event_id>';

Compare sig_key_version to the available key versions on the dyno:

heroku config -a raxx-api-prod | grep TRACE_SIGNING

4. Distinguishing hash chain vs signature failures

The nightly checker (jobs/trace_integrity_check.py) raises separate alerts:

These are always emitted as separate alerts with separate payloads so the ops team can triage them independently. A single bad actor with DB write access could trigger both; key misconfiguration triggers only SIGNATURE_BREACH.


5. Key material reference

Item Location
Private key seeds (prod) Infisical /raxx/v1/prod/raptor/TRACE_SIGNING_PRIVKEY_*
Private key seeds (staging) Infisical /raxx/v1/staging/raptor/TRACE_SIGNING_PRIVKEY_*
Active version (prod) Heroku dyno TRACE_SIGNING_KEY_VERSION_* on raxx-api-prod
Active version (staging) Heroku dyno TRACE_SIGNING_KEY_VERSION_* on raxx-api-staging
Signing service code backend_v2/api/services/trace_signing_service.py
Integrity checker backend_v2/jobs/trace_integrity_check.py

6. Post-rotation audit trail

After every key rotation, add a row to the ops attestation log:

docs/ops/attestation-log/YYYY-MM-DD-trace-key-rotation.md

Include: - Date and time (UTC) - Subsystem(s) rotated - Old version → new version - Reason for rotation - Who performed the rotation - Verification result (pass/fail + checker output)