Raxx · internal docs

internal · gated ↑ index

Operator runbook — Shadow Analytics Pipeline

Last updated: 2026-05-17 UTC Scope: Opt-in shadow aggregate analytics pipeline (raxx-analytics-prod / raxx-analytics-staging) Parent issue: #279 DPIA: docs/security/dpia-shadow-analytics.md ADRs: ADR 0017, ADR 0018


Prerequisites

All secrets are read from Infisical at runtime. Never inline secrets in files or scripts.

Secret Infisical path
Analytics service auth token /raxx/analytics/ANALYTICS_AUTH_TOKEN
Analytics database URL /raxx/analytics/ANALYTICS_DB_URL
Shadow analytics enabled flag Heroku config var SHADOW_ANALYTICS_ENABLED

Heroku app names: - Production: raxx-analytics-prod - Staging: raxx-analytics-staging


1. Deploy procedure

Initial deploy

Run these steps in order. Do not flip SHADOW_ANALYTICS_ENABLED=true until the service health check passes.

Step 1 — Provision secrets in Infisical

Ensure the following paths exist in Infisical under the /raxx/analytics/ folder before writing secrets:

POST /api/v1/folders
body: { "workspaceId": "<WORKSPACE_ID>", "environment": "prod", "path": "/raxx/analytics" }

Write secrets via Infisical CLI or REST. Do not write them directly to any file.

Step 2 — Set Heroku config vars (stdout silenced)

heroku config:set \
  ANALYTICS_AUTH_TOKEN="$(infisical secrets get ANALYTICS_AUTH_TOKEN --path /raxx/analytics --plain)" \
  ANALYTICS_DB_URL="$(infisical secrets get ANALYTICS_DB_URL --path /raxx/analytics --plain)" \
  SHADOW_ANALYTICS_ENABLED=false \
  --app raxx-analytics-prod >/dev/null 2>&1

Step 3 — Deploy the analytics service

git push heroku-analytics-prod main

Or promote from staging:

heroku pipelines:promote --app raxx-analytics-staging >/dev/null 2>&1

Step 4 — Run database migrations

heroku run python manage.py db upgrade --app raxx-analytics-prod

Verify migrations completed without error before proceeding.

Step 5 — Health check

curl -sf -H "Authorization: Bearer $(heroku config:get ANALYTICS_AUTH_TOKEN --app raxx-analytics-prod 2>/dev/null)" \
  https://analytics.raxx.app/health

Expected response: {"status": "ok"} with HTTP 200.

If health check fails: do not flip SHADOW_ANALYTICS_ENABLED. Investigate logs:

heroku logs --tail --app raxx-analytics-prod

Step 6 — Enable the pipeline (only after health check passes)

heroku config:set SHADOW_ANALYTICS_ENABLED=true --app raxx-analytics-prod >/dev/null 2>&1

Verify the flag is set:

heroku config:get SHADOW_ANALYTICS_ENABLED --app raxx-analytics-prod

Expected output: true

Step 7 — Confirm Raptor integration

heroku config:get SHADOW_ANALYTICS_ENABLED --app raxx-api-prod

Raptor reads this flag to determine whether to activate the client-side shadow aggregator path. Confirm it is set to true in both apps.


Subsequent deploys

Standard Heroku pipeline promote workflow. No special steps unless the release includes a database schema migration.

If a migration is included: 1. Deploy to raxx-analytics-staging first. 2. Run migrations on staging: heroku run python manage.py db upgrade --app raxx-analytics-staging 3. Verify staging health check passes. 4. Promote to prod: heroku pipelines:promote --app raxx-analytics-staging >/dev/null 2>&1 5. Run migrations on prod: heroku run python manage.py db upgrade --app raxx-analytics-prod


Rollback

If a deploy causes a regression:

Step 1 — Disable the pipeline immediately (stdout silenced)

heroku config:set SHADOW_ANALYTICS_ENABLED=false --app raxx-analytics-prod >/dev/null 2>&1

Step 2 — Verify opted-in users see the expected degraded state

Opted-in users should see the privacy panel reflect "analytics unavailable" or equivalent — the consent toggle remains set (user preference is preserved), but the shadow aggregator JS goes dormant. Confirm in staging before communicating to users.

Step 3 — Rollback the analytics service release

heroku rollback --app raxx-analytics-prod

Step 4 — Verify health check passes post-rollback

curl -sf -H "Authorization: Bearer $(heroku config:get ANALYTICS_AUTH_TOKEN --app raxx-analytics-prod 2>/dev/null)" \
  https://analytics.raxx.app/health

Step 5 — Re-enable if rollback is clean

heroku config:set SHADOW_ANALYTICS_ENABLED=true --app raxx-analytics-prod >/dev/null 2>&1

Note: Rollback preserves analytics data. Data is not deleted on rollback.


2. Token rotation — ANALYTICS_AUTH_TOKEN

Rotation cadence: Every 90 days or on suspected compromise (whichever comes first).

Step 1 — Generate a new token

Generate a cryptographically random token (minimum 32 bytes, base64url-encoded):

python3 -c "import secrets; print(secrets.token_urlsafe(48))"

Step 2 — Write new token to Infisical

infisical secrets set ANALYTICS_AUTH_TOKEN="<new_token>" --path /raxx/analytics --env prod

Do not commit the token value to any file or log.

Step 3 — Update Heroku config on analytics service (stdout silenced)

heroku config:set \
  ANALYTICS_AUTH_TOKEN="<new_token>" \
  --app raxx-analytics-prod >/dev/null 2>&1

Step 4 — Update Heroku config on Raptor (the caller) (stdout silenced)

heroku config:set \
  ANALYTICS_AUTH_TOKEN="<new_token>" \
  --app raxx-api-prod >/dev/null 2>&1

Step 5 — Verify service-to-service auth

curl -sf -H "Authorization: Bearer <new_token>" https://analytics.raxx.app/health

Expected: HTTP 200 {"status": "ok"}

Step 6 — Revoke old token in Infisical

After confirming Step 5 passes with the new token, overwrite the old token reference in Infisical (Step 2 already does this). If the old token was provisioned as a named credential in Infisical, explicitly mark it inactive via the Infisical dashboard.


3. Emergency stop

Use this procedure when you need to halt all shadow-analytics data collection immediately (e.g., suspected breach, unexpected data in analytics store, regulatory inquiry).

Step 1 — Disable the pipeline (stdout silenced)

heroku config:set SHADOW_ANALYTICS_ENABLED=false --app raxx-analytics-prod >/dev/null 2>&1

Step 2 — Confirm the flag is set

heroku config:get SHADOW_ANALYTICS_ENABLED --app raxx-analytics-prod

Expected output: false

Step 3 — Verify no new signals arrive

Wait 5 minutes, then query the analytics store for rows with created_at > NOW() - INTERVAL '5 minutes':

heroku pg:psql --app raxx-analytics-prod -c \
  "SELECT COUNT(*) FROM aggregate_signals WHERE created_at > NOW() - INTERVAL '5 minutes';"

Expected result: 0 (no new rows after flag flip).

If result is non-zero, the client-side aggregator may be caching signals for retry. Check analytics service logs:

heroku logs --tail --app raxx-analytics-prod

Step 4 — Impact assessment

Step 5 — File a security issue

If the emergency stop was triggered by a suspected security incident, file a GitHub issue immediately with label type:security and severity:critical (or appropriate severity). Do not rely on Slack alone.


SLA: Hard-delete of a user's shadow data must complete within 30 days of opt-out event. Failure to meet this SLA is a potential GDPR Art. 17 violation; notify Kristerpher immediately if a delete job will miss the 30-day window.

Normal flow

When a user opts out in the privacy panel, the Queue service writes an opt-out event to consent_history and enqueues a delete job. The delete pipeline reads the user's pseudonym and removes all analytics records keyed to that pseudonym.

Debug procedure (stalled delete job)

Step 1 — Locate the opt-out event

Query the consent_history table in the Queue service database:

heroku pg:psql --app raxx-queue-prod -c \
  "SELECT user_id, event_type, created_at FROM consent_history WHERE user_id = '<USER_ID>' ORDER BY created_at DESC LIMIT 10;"

Confirm an opt_out event exists and note its timestamp (created_at). This is the start of the 30-day window.

Step 2 — Check delete pipeline job queue

heroku run python manage.py queue inspect --job-type shadow_analytics_delete --app raxx-queue-prod

Look for the job keyed to this user's pseudonym. Status should be pending or completed. If failed:

heroku run python manage.py queue retry --job-id "<JOB_ID>" --app raxx-queue-prod

Step 3 — Verify pseudonym

The pseudonym is derived client-side from the user's WebAuthn PRF output. If the user is still logged in, the client can re-derive it. If the user is not available:

Note: The server does not store the pseudonym derivation input. If the pseudonym cannot be recovered, document this in the incident record. The delete job may need to be triggered manually with the user's cooperation.

Step 4 — Manual delete (last resort)

Only use this if the automated delete pipeline cannot be repaired within the 30-day window.

heroku pg:psql --app raxx-analytics-prod -c \
  "DELETE FROM aggregate_signals WHERE pseudonym = '<PSEUDONYM>';"

This command requires the I_ACCEPT_MANUAL_DELETE environment variable to be set on the analytics service:

heroku config:set I_ACCEPT_MANUAL_DELETE=true --app raxx-analytics-prod >/dev/null 2>&1

After the manual delete: - Confirm row count is 0 for the pseudonym. - Unset the guard: heroku config:unset I_ACCEPT_MANUAL_DELETE --app raxx-analytics-prod >/dev/null 2>&1 - Document the manual delete in the consent_history table as an operator action.

Step 5 — Escalation

If the delete cannot be confirmed complete within 25 days of opt-out (5-day buffer before SLA breach), notify Kristerpher immediately via Slack DM. Do not wait until day 30.


5. k-floor adjustment

Default: k=20. This is the anonymity floor below which no bucket is written to the analytics store.

Do not reduce k without Kristerpher sign-off and a DPIA re-run note. Reducing k weakens the anonymity guarantee and increases re-identification risk (DPIA Section 3, Risk 1).

Verify current k-floor

heroku config:get ANALYTICS_K_FLOOR --app raxx-analytics-prod

Expected output: 20 (or unset, which defaults to 20).

When override may be appropriate

Only if: - Opt-in rate is so low that no buckets are populated (all signals are suppressed), AND - Product value from analytics is effectively zero, AND - Kristerpher has explicitly approved a lower floor, AND - The DPIA has been updated to reflect the new k value.

Override procedure

Step 1 — Kristerpher approval required. Document the approval (issue comment or PR comment) before proceeding.

Step 2 — Set the override guard (stdout silenced)

heroku config:set \
  I_ACCEPT_WEAKER_ANONYMITY=true \
  ANALYTICS_K_FLOOR=<new_k_value> \
  --app raxx-analytics-prod >/dev/null 2>&1

Step 3 — Update the DPIA at docs/security/dpia-shadow-analytics.md Section 4.1 to note the new k value, the date, and the approval reference.

Step 4 — Note the re-run trigger in the DPIA Section 6 review log.


6. Debug flows

Analytics service returns 503

Likely causes: 1. SHADOW_ANALYTICS_ENABLED=false — check with heroku config:get SHADOW_ANALYTICS_ENABLED --app raxx-analytics-prod 2. Service is down — check heroku ps --app raxx-analytics-prod; restart dynos if needed: heroku dyno:restart --app raxx-analytics-prod 3. Database connection failure — check ANALYTICS_DB_URL config var is set and the Postgres add-on is healthy

heroku addons --app raxx-analytics-prod | grep heroku-postgresql
heroku pg:info --app raxx-analytics-prod

Client-side aggregator is silent (no signals arriving)

Likely causes: 1. User has not opted in — check consent_history for the user: SELECT event_type, created_at FROM consent_history WHERE user_id = '<USER_ID>' ORDER BY created_at DESC LIMIT 5; 2. WebAuthn PRF derivation failed for this user's authenticator — older authenticators (pre-2023) may not support the PRF extension; check client-side error logs 3. Analytics service auth token mismatch — verify ANALYTICS_AUTH_TOKEN matches between Raptor and the analytics service

Analytics store is empty (all buckets suppressed)

Likely cause: Opt-in count is below k=20 for all buckets. This is expected at launch.

Verify:

heroku pg:psql --app raxx-analytics-prod -c \
  "SELECT COUNT(*) FROM aggregate_signals;"

If result is 0 and you expect non-zero: check opt-in count in Queue service:

heroku pg:psql --app raxx-queue-prod -c \
  "SELECT COUNT(DISTINCT user_id) FROM consent_history WHERE event_type = 'opt_in' AND created_at > (SELECT MAX(created_at) FROM consent_history WHERE event_type = 'opt_out' AND user_id = consent_history.user_id);"

If opted-in count is below 20, suppression is correct behavior. No action needed.

To get a user's full consent history (for GDPR Art. 7 proof or support query):

heroku pg:psql --app raxx-queue-prod -c \
  "SELECT user_id, event_type, created_at, metadata FROM consent_history WHERE user_id = '<USER_ID>' ORDER BY created_at ASC;"

To determine consent state at a specific timestamp:

heroku pg:psql --app raxx-queue-prod -c \
  "SELECT event_type FROM consent_history WHERE user_id = '<USER_ID>' AND created_at <= '<TIMESTAMP>' ORDER BY created_at DESC LIMIT 1;"

If result is opt_in, the user had active consent at that timestamp. If opt_out or no result, consent was not active.


7. Incident response

Scenario A: Re-identification risk discovered

A re-identification risk means someone has reason to believe that individual users can be identified from the analytics store data (e.g., a security researcher files a report, an internal review finds a schema change that introduced an identifying column, or a bucket is found to have k < 20 in production due to a guard bypass).

Immediate response:

  1. Disable the pipeline now (stdout silenced): bash heroku config:set SHADOW_ANALYTICS_ENABLED=false --app raxx-analytics-prod >/dev/null 2>&1

  2. Preserve the analytics store as-is. Do not delete any data before legal review. The data may be needed as evidence for the DPIA re-run or regulatory inquiry.

  3. Notify Kristerpher immediately via Slack DM. Do not wait to investigate first.

  4. File a GitHub issue with labels type:security and severity:critical. Title format: [security] CRITICAL: shadow-analytics re-identification risk — <short description>

  5. Do not re-enable the pipeline until: - The risk is assessed by Kristerpher and (for any risk that could constitute a GDPR breach) reviewed by counsel - The DPIA is updated - The architectural fix is deployed and verified

Scenario B: Breach suspected (analytics store accessed without authorization)

Signs: unexpected access patterns in Heroku audit log, alerts from Cloudflare WAF, unusual query volume in Postgres logs, report from external researcher.

Immediate response:

  1. Disable the pipeline (stdout silenced): bash heroku config:set SHADOW_ANALYTICS_ENABLED=false --app raxx-analytics-prod >/dev/null 2>&1

  2. Rotate ANALYTICS_AUTH_TOKEN immediately (follow Runbook §2). This revokes any stolen token.

  3. Notify Kristerpher immediately via Slack DM.

  4. File a GitHub issue with labels type:security and severity:critical.

  5. Preserve logs — do not rotate or clear Heroku logs or Postgres logs before they are captured: bash heroku logs -n 1500 --app raxx-analytics-prod > /tmp/analytics-breach-$(date -u +%Y%m%dT%H%M%SZ).log Store the log file securely (private Google Drive or operator-controlled storage).

  6. GDPR Art. 33 notification assessment: A personal data breach involving the analytics store must be reported to the relevant supervisory authority within 72 hours of becoming aware, if the breach is likely to result in a risk to individuals' rights and freedoms. Given the analytics store contains only aggregate bucketed data with k >= 20, a breach of this store is unlikely to constitute a high-risk breach — but this assessment must be made with counsel, not unilaterally by the operator. Contact attorney immediately if breach is confirmed.

  7. Do not re-enable the pipeline until the breach scope is understood and Kristerpher + counsel have approved re-activation.


References