ADR-0111 — Burr v2 Migration: Self-Owned Multi-Region OIDC Gateway
Status: Proposed
Date: 2026-05-30 UTC
Deciders: Kristerpher (operator)
Scope: Post-launch migration of Infisical SSO from Burr v1 (CF Access OIDC SaaS) to Burr v2 (self-owned Lambda OIDC issuer at burr.raxx.app)
Parent epic: #1867
Refs: ADR-0083 (Infisical Google OIDC SSO via CF Access — Burr v1 baseline), ADR-0084 (Burr v2 multi-region OIDC gateway design), ADR-0082 (Terraform pipeline pattern), #1879–#1888
1. Context
What Burr v1 is
Burr v1 (ADR-0083, PR #1866) delegates OIDC issuer authority to Cloudflare Access. A CF Zero Trust SaaS application (type = saas, auth_type = oidc) exposes standard OIDC endpoints backed by Google Workspace. Downstream apps (Infisical first) trust CF Access as their OIDC issuer and never need their own Google OAuth clients.
The v1 Terraform module (terraform/modules/sso-oidc-gateway/) is the reusable pattern. It solved the bootstrapping problem: get Infisical SSO working pre-launch without building self-owned OIDC infrastructure.
What Burr v2 changes — and why
ADR-0084 (accepted 2026-05-12) documents the full architectural design. This ADR is the migration counterpart: it records the decision to execute the migration, maps the 10 implementation cards to migration phases, and surfaces the operator-action prerequisites.
The architectural reason for each cluster of cards:
- CF Access as OIDC issuer = single zone dependency. A CF Access zone outage takes all downstream SSO with it. v2 moves the issuer to operator-controlled Lambda in two AWS regions with R53 latency routing and health-check-driven failover (#1879, #1881, #1882).
- No operator-controlled key rotation. CF manages v1 signing keys; the operator cannot rotate independently. v2 uses a KMS multi-region key (primary us-west-2, replica us-east-1). KMS Sign is the only operation; private key material never leaves KMS (#1885).
- No structured DNS failover. v1 has no health checks and no tertiary fallback. v2 adds 10 R53 health checks, a latency-routed sub-zone, and a CloudFront + S3 Auth Down distribution (#1880, #1882, #1883, #1884, #1887).
- No migration runbook. Moving Infisical's SSO config from one OIDC issuer to another without a 14-day rollback window and documented rollback procedure is high-risk. Card #1886 fills that gap before any live traffic moves.
Why this is operator-AWS-heavy
Every layer of v2 lives in AWS: R53 hosted zone and health checks, API Gateway HTTP API (chosen over ALB in post-ADR-0084 operator decision Q1, locked 2026-05-12), Lambda functions, KMS multi-region key, ACM certificates, CloudFront distribution, CloudWatch alarms, SNS, WAF. None of this can be provisioned without AWS console access + Terraform credentials. Regions are us-west-2 (primary) and us-east-1 (replica/failover), matching operator's existing AWS footprint.
Why now
The operator's post-launch-unlock directive ("anything deferred to post-launch we can start working on") removes the defer:post-launch gate. Actual deployment still requires operator AWS engagement; Terraform code and runbook can be authored and reviewed in parallel.
2. Invariants
The following platform invariants apply to this migration:
- No stored credentials. OIDC signing keys reside in KMS; the Lambda calls
kms:Signand never holds private key material. Client secrets for downstream app registrations live in SSM SecureString. Nothing sensitive in Terraform source, env vars, or Lambda code. - Audit trail for every state change. Every OIDC token issuance, health-check state transition, failover event, and JWKS rotation emits an audit event to CloudWatch Logs (
/raxx/burr/<region>/audit), 90-day retention. Access logs on API Gateway stages: 90-day retention. - Secrets in infra, not code. KMS ARNs, client IDs, and client secrets live in SSM at
/raxx/burr/...paths. Zero secrets in Terraform source or Lambda deployment packages. - GDPR by default. Burr v2 processes identity assertions (email,
subclaim). Tokens are not persisted beyond their lifetime. No PII is written to durable storage. This is consistent with the session-scoped retention policy in ADR-0003. - Passkeys / WebAuthn constraint is not altered. Burr v2 governs internal tooling SSO (operator access to Infisical), not end-user login. The passkey-only constraint on the Raxx product remains unchanged.
3. Decision
Execute the Burr v2 migration across 14 days of parallel running (v1 + v2 co-exist; v1 idle, v2 live) after all infrastructure cards pass staging validation. The migration is operator-executed per the runbook (#1886). v1 CF Access SaaS application for Infisical is decommissioned at the close of the 14-day overlap window.
4. Language choice rationale
Service: Burr v2 Lambda (OIDC token issuer)
Language tier: - [x] Tier 2 — Python (v1 default)
Rationale: The Burr Lambda is a low-traffic internal SSO service (operator-scale, not end-user scale). Traffic is measured in hundreds of daily token exchanges, not millions. Cold-start characteristics are mitigated by provisioned concurrency. The p99 latency budget for an interactive SSO flow is seconds (human-in-the-loop browser redirect), not sub-millisecond. Tier 1 criteria — p99 < 5ms, memory-safety-critical, billing PII, or operator designation — are not met. Python is the correct v1 choice.
API contract portability: Yes. The OIDC protocol contract (endpoint paths, request/response shapes, JWT format, HTTP status codes) is defined by the OIDC Core 1.0 spec. A future Rust port would implement the same HTTP routes against the same KMS and SSM dependencies. No portability risk.
5. Topology
Internet
|
+--- Cloudflare DNS: raxx.app zone
| burr.raxx.app NS → R53 (4 NS records delegated)
|
+--- Route53: burr.raxx.app hosted zone
| burr.raxx.app LATENCY → us-west-2 API GW regional domain [HC-5 us-west-2]
| burr.raxx.app LATENCY → us-east-1 API GW regional domain [HC-5 us-east-1]
| burr.raxx.app FAILOVER SECONDARY → CloudFront Auth Down
| us-west-2.burr.raxx.app ALIAS → us-west-2 API GW domain
| us-east-1.burr.raxx.app ALIAS → us-east-1 API GW domain
| auth-down.burr.raxx.app ALIAS → CloudFront Auth Down
|
+--- API Gateway HTTP API: us-west-2 (WAF + CloudWatch access logs)
| Lambda: burr-oidc-handler-us-west-2
| GET /oidc/.well-known/openid-configuration
| GET /oidc/.well-known/jwks.json
| GET /oidc/authorize
| POST /oidc/token → KMS Sign (mrk-* primary, us-west-2)
| GET /oidc/userinfo
| GET /health → KMS DescribeKey + SSM GetParameter
|
+--- API Gateway HTTP API: us-east-1 (WAF + CloudWatch access logs)
| Lambda: burr-oidc-handler-us-east-1
| (same routes) → KMS Sign (mrk-* replica, us-east-1)
|
+--- CloudFront Auth Down (tertiary failover)
| S3: raxx-auth-down-static
| /auth-down/index.html (static HTML, no JS, no cookies)
| Cache-Control: no-store
|
+--- R53 Health Checks (10 total)
HC-1 per region: OIDC discovery doc (HTTPS, string match "issuer")
HC-2 per region: JWKS endpoint (HTTPS, string match "keys")
HC-3 per region: token endpoint reachable (no 5xx)
HC-4 per region: /health endpoint (HTTPS, string match "kms":"ok")
HC-5 per region: CALCULATED (AND of HC-1..HC-4)
Data flow — happy path:
- Client (Infisical) resolves
burr.raxx.appvia R53 → gets lowest-latency regional API Gateway domain. - Client fetches OIDC discovery doc. Sends user to
/oidc/authorize. - Burr redirects to CF Access (
cloudflareaccess.com) → Google Workspace login. - CF Access redirects back to Burr
/oidc/tokenwith authorization code. - Burr calls
kms:Signagainst the regional MRK replica → returns signed JWT (RS256). - Infisical validates JWT against
burr.raxx.app/oidc/.well-known/jwks.json.
Failover — single region degraded: HC-5 for the degraded region flips unhealthy within ~30s. R53 stops returning that region's alias. Surviving region absorbs 100% of traffic. No operator action.
Failover — both regions degraded: Both HC-5 calculated checks unhealthy. R53 serves the SECONDARY failover record → CloudFront Auth Down. Infisical receives the static HTML page and surfaces an error. Operator checks status.raxx.app. CloudWatch composite alarm fires to #ops-alerts with CRITICAL label.
6. Migration sequence
14-day overlap per operator decision Q4 (ADR-0084 §9, bumped from 7-day default).
| Day | Action | Cards |
|---|---|---|
| 0 (pre-work) | Author runbook; Terraform code committed; no AWS changes | #1886 |
| 0 | Deploy CloudFront + S3 Auth Down (no DNS yet; test via raw CF domain) | #1884 |
| 0–1 | Provision R53 sub-zone + CF NS delegation | #1880 |
| 1 | Provision API Gateway HTTP API per region (ACM certs in each region) | #1879 |
| 1–2 | Implement /health endpoint in Lambda code | #1885 |
| 2 | Create 10 R53 health checks; confirm all HEALTHY via per-region subdomains | #1882 |
| 2 | Add R53 latency-based routing records (health-check-linked) | #1881 |
| 2 | Add R53 failover record → CloudFront Auth Down | #1883 |
| 2–7 | Monitor: all 10 HCs HEALTHY for 48 consecutive hours before proceeding | — |
| 7–14 | Add CloudWatch alarms + Slack alerts (can happen any time after #1879 + #1885) | #1887 |
| Day 7 | Cutover: update Infisical SSO config to Burr v2 OIDC endpoints | #1888 |
| Day 7–21 | 14-day overlap: Burr v1 CF Access SaaS application kept alive (idle) | — |
| Day 21 | Decommission Burr v1 (per decommission checklist in runbook) | #1886 |
Card-to-phase mapping (all 10 cards):
| Card | Title | Phase |
|---|---|---|
| #1886 | Migration runbook | Pre-work (Day 0) — foundational, docs only |
| #1884 | CloudFront + S3 Auth Down | Day 0 — provision before DNS cuts |
| #1880 | R53 sub-zone + CF NS delegation | Day 0–1 — all other cards depend on this zone |
| #1879 | API Gateway HTTP API per region | Day 1 — compute layer, depends on #1880 for ACM validation |
| #1885 | /health endpoint implementation | Day 1–2 — Lambda code, depends on #1879 for routing |
| #1882 | 10 R53 health checks | Day 2 — depends on #1879 + #1885 for endpoints to probe |
| #1881 | R53 latency-based routing records | Day 2 — depends on #1882 for health check IDs |
| #1883 | R53 failover record → CloudFront | Day 2 — depends on #1880 + #1884 |
| #1887 | CloudWatch alarms + Slack alerts | Day 2–7 — depends on #1879 + #1885; can trail slightly |
| #1888 | Infisical SSO migration (operator) | Day 7 — operator action after 48h HC green |
7. Dependency graph
#1880 (R53 sub-zone)
├── #1879 (API Gateway) — ACM cert DNS validation requires zone
│ ├── #1885 (/health endpoint) — Lambda behind API GW
│ │ └── #1882 (R53 health checks) — probes API GW endpoints
│ │ └── #1881 (latency routing) — needs HC IDs
│ │ └── #1883 (failover record) — SECONDARY after latency records
│ └── #1887 (CloudWatch alarms) — alarms on HC-5 metrics + Lambda metrics
└── #1883 (failover record) — also needs the zone
#1884 (CloudFront Auth Down) — independent; #1883 consumes its output domain
#1886 (runbook) — no technical dependency; can be authored in parallel with any card
#1888 (Infisical migration) — depends on all 9 other cards deployed + 48h HC green
Hard blockers (cannot start until dependency is applied in AWS):
- Nothing can be ACM-validated until #1880 is applied and NS delegation propagates.
-
1882 cannot produce passing health checks until #1885 is deployed (Lambda returns valid
/healthresponse). -
1881 cannot be applied without health check IDs from #1882.
-
1883 cannot be applied without CloudFront domain from #1884.
-
1888 cannot proceed until all prior cards are HEALTHY for 48h.
8. What ships in code vs operator-action
| Card | Deliverable | Classification |
|---|---|---|
| #1879 | Terraform: aws_apigatewayv2_*, ACM certs, WAF WebACL |
🟡 Mixed — Terraform code in repo + operator applies in AWS |
| #1880 | Terraform: R53 zone + CF NS delegation records | 🟡 Mixed — Terraform code in repo + operator applies (touches live CF zone) |
| #1881 | Terraform: R53 latency ALIAS records | 🟡 Mixed — Terraform code + operator applies after #1882 |
| #1882 | Terraform: 10 aws_route53_health_check resources |
🟡 Mixed — Terraform code + operator applies |
| #1883 | Terraform: R53 failover SECONDARY record | 🟡 Mixed — Terraform code + operator applies after #1884 |
| #1884 | Terraform: S3 bucket, OAC, CloudFront distribution + placeholder HTML | 🟡 Mixed — S3 HTML content codable; CloudFront provisioning operator-action |
| #1885 | Lambda code: /health handler implementation |
✏️ Codable — pure Lambda code; no AWS provisioning needed beyond deploying the function |
| #1886 | Runbook doc: docs/runbooks/burr-v2-migration.md |
✏️ Codable — docs only, no AWS |
| #1887 | Terraform: CloudWatch alarms, SNS, composite alarm, dashboard | 🟡 Mixed — Terraform code + operator applies; note R53 metrics publish only to us-east-1 |
| #1888 | Operator action: update Infisical SSO config | 🔴 AWS/Infisical-only — pure operator execution per runbook |
Summary: 2 ✏️ Codable, 7 🟡 Mixed, 1 🔴 AWS-only
The two purely-codable cards (#1885, #1886) are the right first claims for feature-developer — they unblock the rest and require no AWS credentials.
9. Operator-AWS prerequisites
Regions: us-west-2 (primary) and us-east-1 (replica + R53 metrics + CloudFront ACM). Both regions must be enabled in the operator's AWS account.
IAM roles required (new):
| Role | Purpose | Permissions |
|---|---|---|
burr-lambda-execution-us-west-2 |
Lambda function role | kms:Sign, kms:DescribeKey on MRK primary; ssm:GetParameter on /raxx/burr/us-west-2/*; CloudWatch Logs write |
burr-lambda-execution-us-east-1 |
Lambda function role | kms:Sign, kms:DescribeKey on MRK replica; ssm:GetParameter on /raxx/burr/us-east-1/*; CloudWatch Logs write |
burr-terraform-deploy |
CI/CD Terraform apply role | Scoped to terraform/burr-v2/ resources only per ADR-0082 |
KMS multi-region key: Operator creates primary in us-west-2, adds replica in us-east-1. Key policy restricts kms:Sign to the two Lambda execution roles only.
Cost at v2 steady-state (operator-scale traffic, per ADR-0084 §10):
| Service | Monthly |
|---|---|
| API Gateway HTTP API (replaces ALB) | ~$1.00 |
| Lambda provisioned concurrency (2 instances × 2 regions) | ~$3.50 |
| R53 health checks (10 × $0.50) | $5.00 |
| KMS MRK primary + replica | $2.00 |
| CloudFront + S3 Auth Down | <$1.00 |
| CloudWatch alarms + dashboard | <$1.00 |
| Total | ~$13/month |
Note: API Gateway replaces the ALB ($32/month) from the ADR-0084 §10 estimate. Operator decision Q1 (locked 2026-05-12) confirmed API Gateway HTTP API. This brings the steady-state cost down from ~$43/month to ~$13/month.
v1 resources decommissioned after 14-day overlap:
- CF Access SaaS application for Infisical (Terraform resource in
terraform/modules/sso-oidc-gateway/):terraform destroy -target cloudflare_access_application.infisical_oidc - SSM parameters at
/raxx/cf-access/infisical_oidc_client_secretand/raxx/cf-access/infisical_oidc_client_id: scheduled for deletion (90-day archive window per runbook) - No R53, Lambda, KMS, or CloudFront resources to destroy (those are v2 resources)
- No IAM roles to revoke in v1 (CF Access SaaS apps have no Lambda execution role)
10. Decommission v1
After day 21 (14-day overlap cleared), decommission checklist:
Step 1 — Confirm v2 health (rollback gate):
- Verify all 10 R53 health checks HEALTHY.
- Verify curl https://burr.raxx.app/oidc/.well-known/openid-configuration returns HTTP 200 with "issuer":"https://burr.raxx.app/oidc".
- Verify Infisical interactive login completes via Burr v2.
- Rollback: if any check fails, do NOT proceed. Revert Infisical SSO config to v1 values from SSM backup.
Step 2 — Disable v1 CF Access SaaS application:
terraform -chdir=terraform/modules/sso-oidc-gateway plan -target cloudflare_access_application.infisical_oidc -var 'enabled=false'
# Review plan — confirms only the one app resource is modified
terraform -chdir=terraform/modules/sso-oidc-gateway apply -target cloudflare_access_application.infisical_oidc
Rollback: re-enable the CF Access app (set enabled=true) and revert Infisical SSO config.
Step 3 — Archive v1 SSM secrets (90-day deletion schedule):
- Tag /raxx/cf-access/infisical_oidc_client_secret and /raxx/cf-access/infisical_oidc_client_id with ScheduledDeletion: <date+90days>.
- Do not delete immediately — the 90-day archive provides last-resort rollback evidence.
- Rollback window: restore tag + update Infisical SSO config (v1 credentials still in SSM until scheduled deletion date).
Step 4 — Confirm Infisical sessions still working: - 24 hours post-decommission, verify no Infisical SSO errors in CloudWatch logs or ops@ inbox. - Rollback: if Infisical sessions fail, check whether the v1 CF Access app was the active issuer somehow (it should not be, but confirm via Infisical admin UI).
11. Security / GDPR checklist
- PII collected: Identity assertions (email,
subclaim) in JWT payloads. These flow through Lambda and are not persisted. Access logs contain no PII fields (request path only, no Authorization header logging). Audit log for token issuance recordsclient_idand timestamp; no email orsubin the audit log. - Retention period: JWT lifetime = 1 hour (access token), 24 hours (session). CloudWatch access logs = 90 days. Audit logs = 90 days. SSM parameters: indefinite until decommission (archived with 90-day deletion schedule post-v2 migration).
- Deletion on DSR: Burr v2 holds no durable PII. DSR for a specific operator identity requires only confirming the 90-day log window — CloudWatch log purge is the mechanism. No separate deletion procedure needed.
- Audit trail: Token issuances, health-check state transitions (ok→error only), and JWKS key rotation events are written to
/raxx/burr/<region>/audit. Steady-state healthy/healthpoll responses are not logged (log volume management). Every API Gateway request is access-logged (structured JSON, no PII fields). - Stored credentials: None. KMS holds private key material and never exposes it. Client secrets are in SSM SecureString. Lambda has no persistent storage.
- Breach notification path: A KMS key compromise or SSM credential exposure triggers: (1) rotate KMS key (creates new
kid; oldkidremoved from JWKS on next deploy), (2) rotate all client secrets in SSM, (3) notify ops@raxx.app within 24 hours per GDPR breach notification posture (ADR-0003). Infisical sessions issued with the compromised key are invalidated when the JWKS endpoint stops returning the oldkid. - Secrets location + rotation: KMS MRK (annual automatic rotation, operator-triggered manual rotation available). SSM SecureString client secrets: rotatable without Lambda redeploy (Lambda reads at cold start; forced cold start via version bump). Rotation does not require a Terraform plan.
- Kill-switch: Setting Lambda reserved concurrency to 0 in either region immediately stops that region from serving OIDC requests. HC-4 trips within 30 seconds and R53 fails over. For a full shutdown: both regions concurrency = 0 → traffic falls to CloudFront Auth Down. Operator can execute via AWS console or CLI without a Terraform change.
12. Alternatives considered
Alternative A: Keep Burr v1 permanently (CF Access as OIDC issuer)
Rejected because: CF Access zone availability becomes a single point of failure for all internal tooling SSO. Operator cannot control signing key rotation or perform independent health monitoring. Acceptable for one app at v1 scale; not acceptable post-launch with Grafana and additional internal tools joining.
Alternative B: Migrate to a third-party OIDC SaaS (Auth0, Okta)
Rejected because: Adds a paid third-party dependency for an internal-tooling-only use case. Operator-scale traffic (tens of sessions per day) does not justify the cost or vendor lock-in. Building on Lambda + KMS keeps the attack surface under operator control and avoids SaaS IdP pricing risk.
Alternative C: ALB instead of API Gateway
Considered in ADR-0084 §10 open question Q1. Rejected post-ADR by operator decision (2026-05-12): API Gateway HTTP API costs ~$1/month vs ALB's ~$16/month/region at operator scale. API Gateway supports WAF attachment, custom domain names, and CloudWatch access logging identically to ALB for this use case.
13. Consequences
Positive
- Operator-controlled OIDC issuer; no CF Access dependency for internal tooling SSO.
- Independent key rotation without downstream app changes (JWKS endpoint serves both active and previous
kidduring rotation window). - Health-check-driven automatic regional failover within ~30 seconds.
- Static Auth Down page prevents silent failures during dual-region outage.
- Steady-state cost ~$13/month (vs ~$43/month if ALB had been chosen).
Negative / risks
- AWS operational surface: 10 R53 health checks, 2 API Gateways, 2 Lambda functions, 1 KMS MRK + replica, 1 CloudFront distribution, 2 CloudWatch alarm sets. More components = more potential failure modes to monitor.
- 14-day overlap window means two OIDC issuers' infrastructure running simultaneously (Burr v1 idle, Burr v2 live). Burr v1 is idle (not serving traffic), but it still exists as a CF Access application and incurs minor CF Access SaaS application overhead until decommissioned.
- KMS
kms:Signcall on every token issuance adds ~5–15ms latency vs signing in-process. Acceptable for interactive SSO flows.
Neutral
- Lambda provisioned concurrency cost ($3.50/month) is incurred even at zero traffic. This is the trade-off for eliminating cold-start false-positive health check failures.
- Terraform root placement (
terraform/burr-v2/) is implementer's decision per card #1880 scope note.
14. Open questions (require operator decision before cards can be claimed)
-
AWS region selection confirmed? ADR-0084 and the 10 cards assume us-west-2 (primary) + us-east-1 (replica). If operator wants a different region pair (e.g., us-east-1 primary + eu-west-1 for GDPR-sensitive tooling future), the KMS MRK provisioning and ACM certificate configuration change. Confirm before #1879 is claimed.
-
CloudFront price class? Card #1884 hardcodes
PriceClass_100(US + Europe only, cheapest). If operator wants global PoP coverage (PriceClass_All), update before #1884 is claimed. At internal-tooling scalePriceClass_100is appropriate; surfacing for explicit confirmation. -
Monthly cost ceiling / auto-shutdown? No budget alarm is specified in the 10 cards. At ~$13/month there is no cost risk, but confirming whether operator wants a CloudWatch billing alarm (e.g., alert at $25/month) before #1887 is claimed would allow it to be included in that card's scope.
-
Migration kickoff date? All 9 Terraform + code cards can be claimed immediately. The Infisical cutover (#1888) is the date-sensitive step. Operator should confirm the maintenance window (low-activity period) before #1888 moves to in-progress, per the active-session note in #1886.
15. References
- ADR-0083 — Infisical Google OIDC SSO via Cloudflare Access (Burr v1 baseline)
- ADR-0084 — Burr v2 multi-region OIDC gateway (full architecture design; this ADR is the migration decision record)
- ADR-0082 — Terraform pipeline pattern (per-root workflow, OIDC credentials)
- ADR-0003 — Platform auth + GDPR retention posture
- ADR-0077 — WAF layered defense (pattern reused for API Gateway WAF WebACL)
-
1867 — Burr v2 parent epic
-
1879–#1888 — 10 implementation sub-cards
terraform/modules/sso-oidc-gateway/— Burr v1 module (retained for non-Infisical CF-Access-as-issuer use cases)docs/architecture/language-tier-policy.mdproject_burr_sso_gateway— codename and architectural roleproject_infisical_sso_not_pursued— vault SSO deferred; Burr v2 is the post-launch migration target
Revisit when
- Queue Phase 2 ships any operator-identity changes that affect how Infisical authentication integrates with internal RBAC.
- A second downstream app (Grafana) is ready to migrate to Burr v2; at that point the runbook's "future app migration template" section (#1886) should be exercised and this ADR updated with the second migration record.
- Operator designates Burr v2 as Tier 1 (Rust rewrite) — this ADR's language rationale section would then be superseded by the Tier 1 election ADR.