Raxx · internal docs

internal · gated

Burr v1 to v2 Migration Runbook

Status: Ready for operator execution (requires all Burr v2 Terraform cards complete) Card: #1886 ADR refs: ADR-0084 §9 (OIDC client migration path), §16 (rollout plan) Operator decision Q4: 14-day decommission overlap (extended from ADR-0084 §9 default of 7 days)


Overview

Burr v1 (CF Access SaaS OIDC application) is replaced by Burr v2 (self-owned Lambda OIDC issuer, dual-region). This runbook covers the cutover for the first downstream app: Infisical.

Burr v2 must be fully deployed and its R53 health checks green for 48+ hours before this runbook begins. Infisical interactive sessions continue working throughout the migration window — operators are re-prompted to authenticate via Burr v2 on their next session boundary.

Overlap window: Burr v1 infrastructure remains live for 14 days post-migration. No Burr v1 resources are decommissioned before Day 14. During this window Burr v1 is idle (Infisical is pointed at v2 only); it exists solely as a rollback path.


Terraform root references

Surface Root
Burr v2 per-region Lambda + ALB terraform/burr/regions/<region>/
R53 hosted zone + latency records terraform/burr/dns/
Burr v1 CF Access SaaS app (Infisical) terraform/modules/sso-oidc-gateway/

SSM parameter map

Parameter Description
/raxx/burr/<region>/kms_signing_key_arn KMS MRK replica ARN for the region
/raxx/burr/<region>/alb_dns ALB DNS name for the region
/raxx/burr/clients/infisical/client_id Burr v2 client ID for Infisical
/raxx/burr/clients/infisical/client_secret Burr v2 client secret (SecureString)
/raxx/cf-access/infisical_oidc_client_id Burr v1 client ID (read for rollback)
/raxx/cf-access/infisical_oidc_client_secret Burr v1 client secret (read for rollback)
/raxx/cf-access/account_id CF account ID (read for rollback issuer URL)

Pre-work checklist

Complete all items before beginning Day 0.


Day 0: Shadow deploy (v2 dark, no traffic)

Pre-condition: All pre-work checklist items complete.

Purpose: Confirm both regional Lambda functions are running and /health returns 200 ok via the per-region ALB hostnames. No public DNS change yet.

Steps

  1. Confirm /health on both regions via per-region ALB: curl -sf https://us-west-2.burr.raxx.app/health | python3 -m json.tool curl -sf https://us-east-1.burr.raxx.app/health | python3 -m json.tool Expected response from each: json { "status": "ok", "kms": "ok", "ssm": "ok", "upstream_google_oidc": "ok", "region": "<region>" }

  2. Confirm JWKS endpoint returns a valid RSA public key: curl -sf https://us-west-2.burr.raxx.app/oidc/.well-known/jwks.json | python3 -c " import sys, json d = json.load(sys.stdin) assert d.get('keys'), 'no keys' k = d['keys'][0] assert all(f in k for f in ['kty','kid','use','n','e']), f'missing fields: {k.keys()}' print('JWKS ok, kid=' + k['kid']) "

  3. Confirm both regions return the same kid value (multi-region key consistency): KID_W=$(curl -sf https://us-west-2.burr.raxx.app/oidc/.well-known/jwks.json | python3 -c "import sys,json; print(json.load(sys.stdin)['keys'][0]['kid'])") KID_E=$(curl -sf https://us-east-1.burr.raxx.app/oidc/.well-known/jwks.json | python3 -c "import sys,json; print(json.load(sys.stdin)['keys'][0]['kid'])") echo "us-west-2 kid: $KID_W" echo "us-east-1 kid: $KID_E" [ "$KID_W" = "$KID_E" ] && echo "KID MATCH: ok" || echo "KID MISMATCH: STOP — investigate before proceeding"

  4. Log Day 0 start time (UTC) in the ops channel.

Verification: All three curl commands succeed. KID values match. /health returns "status": "ok" on both regions.

Rollback (Day 0): Nothing to roll back. No DNS or config changes have been made. Investigate Lambda or KMS issues using CloudWatch Logs:

aws logs tail /raxx/burr/us-west-2/audit --follow
aws logs tail /raxx/burr/us-east-1/audit --follow

Day 1–7: Synthetic probe verification

Pre-condition: Day 0 verification passed.

Purpose: Monitor all 10 R53 health checks continuously. No traffic routing to v2 yet. Detect any flapping, cold-start issues, or KMS latency before committing to the NS delegation cutover.

Steps

  1. Confirm all 10 health checks remain green throughout this window. Query daily: aws route53 list-health-checks \ --query "HealthChecks[?contains(HealthCheckConfig.FullyQualifiedDomainName, 'burr.raxx.app')].Id" \ --output text | tr '\t' '\n' | while read id; do status=$(aws route53 get-health-check-status --health-check-id "$id" \ --query "HealthCheckObservations[0].StatusReport.Status" --output text) echo "$id: $status" done All must show Success.

  2. Check CloudWatch for any state-transition audit events (any event here means an unexpected KMS or SSM degradation): aws logs filter-log-events \ --log-group-name /raxx/burr/us-west-2/audit \ --filter-pattern '{ $.event_type = "health_state_transition" }' \ --start-time $(date -u -d '24 hours ago' +%s000 2>/dev/null || date -u -v-24H +%s000) \ --query "events[].message" --output text Repeat for us-east-1. Any results must be investigated before proceeding to Day 7.

  3. CloudWatch metric check — confirm health check invocation counts are consistent with the 10-second R53 poll interval (~8,640/day per region): aws cloudwatch get-metric-statistics \ --namespace AWS/Lambda \ --metric-name Invocations \ --dimensions Name=FunctionName,Value=burr-oidc-handler-us-west-2 \ --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-24H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 86400 \ --statistics Sum

Verification: 7 days of green health checks with no state-transition audit events.

Rollback (Day 1–7): No DNS changes have been made. Nothing to revert. If health checks are flapping, diagnose with CloudWatch Logs before proceeding.


Day 7: NS delegation cutover and canary (1% traffic to v2)

Pre-condition: 7 days of clean synthetic probe verification.

Purpose: Add the burr.raxx.app NS delegation to the Cloudflare raxx.app zone so that public DNS resolves burr.raxx.app via R53. Confirm resolution from two independent resolvers. At this point 100% of new DNS resolutions for burr.raxx.app route to R53 latency records (which target Burr v2 ALBs). Infisical is still pointed at Burr v1 — no Infisical traffic reaches v2 yet.

Steps

  1. Retrieve the R53 hosted zone NS records: ZONE_ID=$(aws route53 list-hosted-zones-by-name \ --dns-name burr.raxx.app \ --query "HostedZones[0].Id" --output text | sed 's|/hostedzone/||') echo "Zone ID: $ZONE_ID" aws route53 list-resource-record-sets \ --hosted-zone-id "$ZONE_ID" \ --query "ResourceRecordSets[?Type=='NS'].ResourceRecords[].Value" \ --output text You will get four NS hostnames (e.g. ns-1234.awsdns-12.com., etc.).

  2. Apply the NS delegation in Terraform (Cloudflare raxx.app zone): cd terraform/burr/dns terraform plan -var="enable_ns_delegation=true" # Review — confirm only the NS record for burr.raxx.app is being added terraform apply -var="enable_ns_delegation=true" This step is REVERSIBLE — removing the NS record from the CF zone reverts DNS resolution to CF-only (Burr v2 ALBs remain up but unreachable via burr.raxx.app).

  3. Wait 120 seconds for CF zone propagation, then verify from two independent resolvers: dig NS burr.raxx.app @1.1.1.1 +short dig NS burr.raxx.app @8.8.8.8 +short Both must return the R53 NS hostnames retrieved in step 1.

  4. Verify burr.raxx.app latency record resolves to an ALB IP: dig A burr.raxx.app @1.1.1.1 +short Must return one or more IP addresses (ALB node IPs).

  5. Confirm the public /health endpoint is reachable via the apex hostname: curl -sf https://burr.raxx.app/health | python3 -m json.tool Must return "status": "ok".

  6. Confirm OIDC discovery document is reachable via apex: curl -sf https://burr.raxx.app/oidc/.well-known/openid-configuration | python3 -m json.tool | grep issuer Must return "issuer": "https://burr.raxx.app/oidc".

Verification: NS delegation confirmed from two resolvers. Apex /health and discovery doc reachable.

Rollback (Day 7 — DNS step):

cd terraform/burr/dns
terraform apply -var="enable_ns_delegation=false"

Confirm NS records removed:

dig NS burr.raxx.app @1.1.1.1 +short
# Should return empty or CF-managed NS — not R53 NS

Day 7–14: Traffic ramp

Pre-condition: NS delegation confirmed. Burr v2 /health and discovery doc reachable via apex.

This phase ramps Infisical traffic progressively. Each step is operator-executed. Infisical will be re-configured once, to Burr v2 — there is no partial traffic split at the Infisical level. The "ramp" here refers to operator confidence, not a weighted DNS split.

Step 1 — Retrieve Burr v2 client credentials

V2_CLIENT_ID=$(aws ssm get-parameter \
  --name /raxx/burr/clients/infisical/client_id \
  --query Parameter.Value --output text)

V2_CLIENT_SECRET=$(aws ssm get-parameter \
  --name /raxx/burr/clients/infisical/client_secret \
  --with-decryption \
  --query Parameter.Value --output text)

echo "client_id: $V2_CLIENT_ID"
# Do not echo client_secret to the terminal; use it directly in the Infisical config step

Step 2 — Update Infisical SSO config

In the Infisical admin panel (vault.raxx.app → Organization Settings → SSO):

Field New value
Issuer URL https://burr.raxx.app/oidc
Client ID (value of $V2_CLIENT_ID)
Client Secret (value of $V2_CLIENT_SECRET)
JWKS URI https://burr.raxx.app/oidc/.well-known/jwks.json
Authorization Endpoint https://burr.raxx.app/oidc/authorize
Token Endpoint https://burr.raxx.app/oidc/token
Userinfo Endpoint https://burr.raxx.app/oidc/userinfo

Save the configuration.

This step is REVERSIBLE — see Rollback procedure below.

Step 3 — Validate interactive login

  1. Open an incognito/private browser window.
  2. Navigate to vault.raxx.app and click "Login with SSO".
  3. Complete Google Workspace authentication via CF Access.
  4. Confirm successful redirect back to Infisical and session established.
  5. Verify the JWT iss claim is https://burr.raxx.app/oidc: # In the browser developer tools, locate the Infisical session token (cookie or localStorage). # Decode the JWT payload (base64 decode the middle segment): echo "<token_payload_segment>" | base64 -d 2>/dev/null | python3 -m json.tool | grep iss Must return "iss": "https://burr.raxx.app/oidc".

  6. Confirm audit event in CloudWatch: aws logs filter-log-events \ --log-group-name /raxx/burr/us-west-2/audit \ --filter-pattern '{ $.event_type = "health_state_transition" }' \ --start-time $(date -u -d '10 minutes ago' +%s000 2>/dev/null || date -u -v-10M +%s000) \ --query "events[].message" --output text No state-transition events expected (health remains ok).

Verification at Day 7: Interactive Infisical login succeeds. JWT iss is https://burr.raxx.app/oidc. No audit state-transition events.

Burr v1 status: Still alive, idle. Do not decommission.


Day 14: Decommission v1

Pre-condition: 14-day overlap window elapsed from the Day 7 Infisical config update. Infisical sessions have naturally expired and been re-established via Burr v2. No operator has needed to roll back.

Decommission checklist

  1. Confirm Infisical is still working via Burr v2 (repeat the interactive login test from Day 7 Step 3).

  2. Confirm no Infisical sessions with Burr v1 iss claim exist (all sessions re-established via v2 within 24h of migration; by Day 14 all v1 tokens have long expired).

  3. Disable the Burr v1 CF Access SaaS application for Infisical via Terraform: cd terraform/modules/sso-oidc-gateway terraform plan -var="infisical_app_enabled=false" # Review — confirm only the Infisical SaaS app is being disabled terraform apply -var="infisical_app_enabled=false" This step is NOT easily reversible in-place — re-enabling will generate new client credentials, requiring an Infisical SSO config update with the new v1 values. Only proceed if Infisical login via Burr v2 is confirmed working.

  4. Schedule the Burr v1 Infisical SSM secrets for deletion (90-day retention per audit requirements): ``` aws ssm add-tags-to-resource \ --resource-type Parameter \ --resource-id /raxx/cf-access/infisical_oidc_client_id \ --tags Key=scheduled_deletion,Value=$(date -u -d '90 days' +%Y-%m-%d 2>/dev/null || date -u -v+90d +%Y-%m-%d)

aws ssm add-tags-to-resource \ --resource-type Parameter \ --resource-id /raxx/cf-access/infisical_oidc_client_secret \ --tags Key=scheduled_deletion,Value=$(date -u -d '90 days' +%Y-%m-%d 2>/dev/null || date -u -v+90d +%Y-%m-%d) ```

  1. Log decommission completion in the ops channel with the timestamp (UTC).

Verification: Infisical login via Burr v2 confirmed post-decommission. Burr v1 Terraform apply shows no active SaaS app.


Rollback procedure (any phase)

Rollback: Infisical SSO config → Burr v1

This rollback applies if Infisical authentication is failing via Burr v2 at any point before Day 14. Burr v1 infrastructure is still alive during the 14-day overlap window.

  1. Retrieve Burr v1 credentials (from SSM or your secure scratch location from pre-work): ``` V1_CF_ACCOUNT=$(aws ssm get-parameter \ --name /raxx/cf-access/account_id \ --query Parameter.Value --output text)

V1_CLIENT_ID=$(aws ssm get-parameter \ --name /raxx/cf-access/infisical_oidc_client_id \ --query Parameter.Value --output text)

V1_CLIENT_SECRET=$(aws ssm get-parameter \ --name /raxx/cf-access/infisical_oidc_client_secret \ --with-decryption \ --query Parameter.Value --output text)

V1_ISSUER="https://${V1_CF_ACCOUNT}.cloudflareaccess.com" echo "v1 issuer: $V1_ISSUER" echo "v1 client_id: $V1_CLIENT_ID" ```

  1. Revert Infisical SSO config in the Infisical admin panel:
Field Rollback value
Issuer URL https://<CF_ACCOUNT_ID>.cloudflareaccess.com
Client ID (value of $V1_CLIENT_ID)
Client Secret (value of $V1_CLIENT_SECRET)
JWKS URI https://<CF_ACCOUNT_ID>.cloudflareaccess.com/cdn-cgi/access/certs
Authorization Endpoint (CF Access SaaS app authorization endpoint)
Token Endpoint (CF Access SaaS app token endpoint)
Userinfo Endpoint (CF Access SaaS app userinfo endpoint)
  1. Verify rollback with dig and curl: # Confirm Infisical OIDC discovery doc issuer reverted to CF curl -sf https://vault.raxx.app/.well-known/openid-configuration | python3 -m json.tool | grep issuer # Expected: "issuer": "https://<CF_ACCOUNT_ID>.cloudflareaccess.com"

  2. Test interactive login via Infisical.

  3. Active sessions using Burr v2 tokens: Active sessions established via Burr v2 continue working until their token expires (up to 24h max per CF Access session policy). Operators are re-prompted to authenticate via Burr v1 on next session boundary. If immediate full rollback is required (e.g. security incident), force-invalidate all Infisical sessions in the Infisical admin panel (Sessions → Revoke All).

Rollback: NS delegation removal

If the NS delegation must be removed (e.g. Burr v2 ALBs are down and rollback to CF-only DNS is needed):

cd terraform/burr/dns
terraform apply -var="enable_ns_delegation=false"

Verify:

dig NS burr.raxx.app @1.1.1.1 +short
# Should return empty or no R53 NS records
dig NS burr.raxx.app @8.8.8.8 +short

Note: removing the NS delegation does not affect Infisical if it is still configured to Burr v1 issuer. If Infisical is already on Burr v2 when this rollback is needed, follow the Infisical SSO rollback steps above first.


Future downstream app migration template

Additional apps (e.g. Grafana) follow the same sequence. Per-app steps:

  1. Register the new client in Burr v2: # Via the Burr v2 client-registration API or Terraform # SSM paths: /raxx/burr/clients/<app_name>/client_id # /raxx/burr/clients/<app_name>/client_secret

  2. Configure the app to use Burr v2 OIDC endpoints (same values as Infisical migration, Step 2 above; only client_id and client_secret differ per app).

  3. Validate interactive login for the app using the same JWT iss claim check.

  4. 14-day overlap — keep the corresponding Burr v1 CF Access SaaS application alive for 14 days post-migration.

  5. Decommission the Burr v1 CF Access SaaS app for this specific downstream app after the overlap window closes: cd terraform/modules/sso-oidc-gateway terraform plan -var="<app_name>_app_enabled=false" terraform apply -var="<app_name>_app_enabled=false"

The terraform/modules/sso-oidc-gateway/ module manages the Burr v1 CF Access SaaS applications. Check terraform output for the exact resource paths before applying.


Monitoring during migration window

Health-check failures go to the ops digest (not per-event alerts) per pre-launch digest notification policy. Check the daily digest for:

For immediate investigation of a /health failure during the migration window:

aws logs tail /raxx/burr/us-west-2/audit --follow
aws logs tail /raxx/burr/us-east-1/audit --follow