Raxx · internal docs

internal · gated

ADR-0111 — Burr v2 Migration: Self-Owned Multi-Region OIDC Gateway

Status: Proposed Date: 2026-05-30 UTC Deciders: Kristerpher (operator) Scope: Post-launch migration of Infisical SSO from Burr v1 (CF Access OIDC SaaS) to Burr v2 (self-owned Lambda OIDC issuer at burr.raxx.app) Parent epic: #1867 Refs: ADR-0083 (Infisical Google OIDC SSO via CF Access — Burr v1 baseline), ADR-0084 (Burr v2 multi-region OIDC gateway design), ADR-0082 (Terraform pipeline pattern), #1879–#1888


1. Context

What Burr v1 is

Burr v1 (ADR-0083, PR #1866) delegates OIDC issuer authority to Cloudflare Access. A CF Zero Trust SaaS application (type = saas, auth_type = oidc) exposes standard OIDC endpoints backed by Google Workspace. Downstream apps (Infisical first) trust CF Access as their OIDC issuer and never need their own Google OAuth clients.

The v1 Terraform module (terraform/modules/sso-oidc-gateway/) is the reusable pattern. It solved the bootstrapping problem: get Infisical SSO working pre-launch without building self-owned OIDC infrastructure.

What Burr v2 changes — and why

ADR-0084 (accepted 2026-05-12) documents the full architectural design. This ADR is the migration counterpart: it records the decision to execute the migration, maps the 10 implementation cards to migration phases, and surfaces the operator-action prerequisites.

The architectural reason for each cluster of cards:

Why this is operator-AWS-heavy

Every layer of v2 lives in AWS: R53 hosted zone and health checks, API Gateway HTTP API (chosen over ALB in post-ADR-0084 operator decision Q1, locked 2026-05-12), Lambda functions, KMS multi-region key, ACM certificates, CloudFront distribution, CloudWatch alarms, SNS, WAF. None of this can be provisioned without AWS console access + Terraform credentials. Regions are us-west-2 (primary) and us-east-1 (replica/failover), matching operator's existing AWS footprint.

Why now

The operator's post-launch-unlock directive ("anything deferred to post-launch we can start working on") removes the defer:post-launch gate. Actual deployment still requires operator AWS engagement; Terraform code and runbook can be authored and reviewed in parallel.


2. Invariants

The following platform invariants apply to this migration:


3. Decision

Execute the Burr v2 migration across 14 days of parallel running (v1 + v2 co-exist; v1 idle, v2 live) after all infrastructure cards pass staging validation. The migration is operator-executed per the runbook (#1886). v1 CF Access SaaS application for Infisical is decommissioned at the close of the 14-day overlap window.


4. Language choice rationale

Service: Burr v2 Lambda (OIDC token issuer)

Language tier: - [x] Tier 2 — Python (v1 default)

Rationale: The Burr Lambda is a low-traffic internal SSO service (operator-scale, not end-user scale). Traffic is measured in hundreds of daily token exchanges, not millions. Cold-start characteristics are mitigated by provisioned concurrency. The p99 latency budget for an interactive SSO flow is seconds (human-in-the-loop browser redirect), not sub-millisecond. Tier 1 criteria — p99 < 5ms, memory-safety-critical, billing PII, or operator designation — are not met. Python is the correct v1 choice.

API contract portability: Yes. The OIDC protocol contract (endpoint paths, request/response shapes, JWT format, HTTP status codes) is defined by the OIDC Core 1.0 spec. A future Rust port would implement the same HTTP routes against the same KMS and SSM dependencies. No portability risk.


5. Topology

Internet
  |
  +--- Cloudflare DNS: raxx.app zone
  |       burr.raxx.app NS → R53 (4 NS records delegated)
  |
  +--- Route53: burr.raxx.app hosted zone
  |       burr.raxx.app  LATENCY  → us-west-2 API GW regional domain  [HC-5 us-west-2]
  |       burr.raxx.app  LATENCY  → us-east-1 API GW regional domain  [HC-5 us-east-1]
  |       burr.raxx.app  FAILOVER SECONDARY  → CloudFront Auth Down
  |       us-west-2.burr.raxx.app  ALIAS  → us-west-2 API GW domain
  |       us-east-1.burr.raxx.app  ALIAS  → us-east-1 API GW domain
  |       auth-down.burr.raxx.app  ALIAS  → CloudFront Auth Down
  |
  +--- API Gateway HTTP API: us-west-2        (WAF + CloudWatch access logs)
  |       Lambda: burr-oidc-handler-us-west-2
  |         GET  /oidc/.well-known/openid-configuration
  |         GET  /oidc/.well-known/jwks.json
  |         GET  /oidc/authorize
  |         POST /oidc/token  →  KMS Sign (mrk-* primary, us-west-2)
  |         GET  /oidc/userinfo
  |         GET  /health      →  KMS DescribeKey + SSM GetParameter
  |
  +--- API Gateway HTTP API: us-east-1        (WAF + CloudWatch access logs)
  |       Lambda: burr-oidc-handler-us-east-1
  |         (same routes)    →  KMS Sign (mrk-* replica, us-east-1)
  |
  +--- CloudFront Auth Down (tertiary failover)
  |       S3: raxx-auth-down-static
  |         /auth-down/index.html  (static HTML, no JS, no cookies)
  |         Cache-Control: no-store
  |
  +--- R53 Health Checks (10 total)
          HC-1 per region: OIDC discovery doc (HTTPS, string match "issuer")
          HC-2 per region: JWKS endpoint (HTTPS, string match "keys")
          HC-3 per region: token endpoint reachable (no 5xx)
          HC-4 per region: /health endpoint (HTTPS, string match "kms":"ok")
          HC-5 per region: CALCULATED (AND of HC-1..HC-4)

Data flow — happy path:

  1. Client (Infisical) resolves burr.raxx.app via R53 → gets lowest-latency regional API Gateway domain.
  2. Client fetches OIDC discovery doc. Sends user to /oidc/authorize.
  3. Burr redirects to CF Access (cloudflareaccess.com) → Google Workspace login.
  4. CF Access redirects back to Burr /oidc/token with authorization code.
  5. Burr calls kms:Sign against the regional MRK replica → returns signed JWT (RS256).
  6. Infisical validates JWT against burr.raxx.app/oidc/.well-known/jwks.json.

Failover — single region degraded: HC-5 for the degraded region flips unhealthy within ~30s. R53 stops returning that region's alias. Surviving region absorbs 100% of traffic. No operator action.

Failover — both regions degraded: Both HC-5 calculated checks unhealthy. R53 serves the SECONDARY failover record → CloudFront Auth Down. Infisical receives the static HTML page and surfaces an error. Operator checks status.raxx.app. CloudWatch composite alarm fires to #ops-alerts with CRITICAL label.


6. Migration sequence

14-day overlap per operator decision Q4 (ADR-0084 §9, bumped from 7-day default).

Day Action Cards
0 (pre-work) Author runbook; Terraform code committed; no AWS changes #1886
0 Deploy CloudFront + S3 Auth Down (no DNS yet; test via raw CF domain) #1884
0–1 Provision R53 sub-zone + CF NS delegation #1880
1 Provision API Gateway HTTP API per region (ACM certs in each region) #1879
1–2 Implement /health endpoint in Lambda code #1885
2 Create 10 R53 health checks; confirm all HEALTHY via per-region subdomains #1882
2 Add R53 latency-based routing records (health-check-linked) #1881
2 Add R53 failover record → CloudFront Auth Down #1883
2–7 Monitor: all 10 HCs HEALTHY for 48 consecutive hours before proceeding
7–14 Add CloudWatch alarms + Slack alerts (can happen any time after #1879 + #1885) #1887
Day 7 Cutover: update Infisical SSO config to Burr v2 OIDC endpoints #1888
Day 7–21 14-day overlap: Burr v1 CF Access SaaS application kept alive (idle)
Day 21 Decommission Burr v1 (per decommission checklist in runbook) #1886

Card-to-phase mapping (all 10 cards):

Card Title Phase
#1886 Migration runbook Pre-work (Day 0) — foundational, docs only
#1884 CloudFront + S3 Auth Down Day 0 — provision before DNS cuts
#1880 R53 sub-zone + CF NS delegation Day 0–1 — all other cards depend on this zone
#1879 API Gateway HTTP API per region Day 1 — compute layer, depends on #1880 for ACM validation
#1885 /health endpoint implementation Day 1–2 — Lambda code, depends on #1879 for routing
#1882 10 R53 health checks Day 2 — depends on #1879 + #1885 for endpoints to probe
#1881 R53 latency-based routing records Day 2 — depends on #1882 for health check IDs
#1883 R53 failover record → CloudFront Day 2 — depends on #1880 + #1884
#1887 CloudWatch alarms + Slack alerts Day 2–7 — depends on #1879 + #1885; can trail slightly
#1888 Infisical SSO migration (operator) Day 7 — operator action after 48h HC green

7. Dependency graph

#1880 (R53 sub-zone)
  ├── #1879 (API Gateway) — ACM cert DNS validation requires zone
  │     ├── #1885 (/health endpoint) — Lambda behind API GW
  │     │     └── #1882 (R53 health checks) — probes API GW endpoints
  │     │           └── #1881 (latency routing) — needs HC IDs
  │     │                 └── #1883 (failover record) — SECONDARY after latency records
  │     └── #1887 (CloudWatch alarms) — alarms on HC-5 metrics + Lambda metrics
  └── #1883 (failover record) — also needs the zone

#1884 (CloudFront Auth Down) — independent; #1883 consumes its output domain

#1886 (runbook) — no technical dependency; can be authored in parallel with any card
#1888 (Infisical migration) — depends on all 9 other cards deployed + 48h HC green

Hard blockers (cannot start until dependency is applied in AWS):


8. What ships in code vs operator-action

Card Deliverable Classification
#1879 Terraform: aws_apigatewayv2_*, ACM certs, WAF WebACL 🟡 Mixed — Terraform code in repo + operator applies in AWS
#1880 Terraform: R53 zone + CF NS delegation records 🟡 Mixed — Terraform code in repo + operator applies (touches live CF zone)
#1881 Terraform: R53 latency ALIAS records 🟡 Mixed — Terraform code + operator applies after #1882
#1882 Terraform: 10 aws_route53_health_check resources 🟡 Mixed — Terraform code + operator applies
#1883 Terraform: R53 failover SECONDARY record 🟡 Mixed — Terraform code + operator applies after #1884
#1884 Terraform: S3 bucket, OAC, CloudFront distribution + placeholder HTML 🟡 Mixed — S3 HTML content codable; CloudFront provisioning operator-action
#1885 Lambda code: /health handler implementation ✏️ Codable — pure Lambda code; no AWS provisioning needed beyond deploying the function
#1886 Runbook doc: docs/runbooks/burr-v2-migration.md ✏️ Codable — docs only, no AWS
#1887 Terraform: CloudWatch alarms, SNS, composite alarm, dashboard 🟡 Mixed — Terraform code + operator applies; note R53 metrics publish only to us-east-1
#1888 Operator action: update Infisical SSO config 🔴 AWS/Infisical-only — pure operator execution per runbook

Summary: 2 ✏️ Codable, 7 🟡 Mixed, 1 🔴 AWS-only

The two purely-codable cards (#1885, #1886) are the right first claims for feature-developer — they unblock the rest and require no AWS credentials.


9. Operator-AWS prerequisites

Regions: us-west-2 (primary) and us-east-1 (replica + R53 metrics + CloudFront ACM). Both regions must be enabled in the operator's AWS account.

IAM roles required (new):

Role Purpose Permissions
burr-lambda-execution-us-west-2 Lambda function role kms:Sign, kms:DescribeKey on MRK primary; ssm:GetParameter on /raxx/burr/us-west-2/*; CloudWatch Logs write
burr-lambda-execution-us-east-1 Lambda function role kms:Sign, kms:DescribeKey on MRK replica; ssm:GetParameter on /raxx/burr/us-east-1/*; CloudWatch Logs write
burr-terraform-deploy CI/CD Terraform apply role Scoped to terraform/burr-v2/ resources only per ADR-0082

KMS multi-region key: Operator creates primary in us-west-2, adds replica in us-east-1. Key policy restricts kms:Sign to the two Lambda execution roles only.

Cost at v2 steady-state (operator-scale traffic, per ADR-0084 §10):

Service Monthly
API Gateway HTTP API (replaces ALB) ~$1.00
Lambda provisioned concurrency (2 instances × 2 regions) ~$3.50
R53 health checks (10 × $0.50) $5.00
KMS MRK primary + replica $2.00
CloudFront + S3 Auth Down <$1.00
CloudWatch alarms + dashboard <$1.00
Total ~$13/month

Note: API Gateway replaces the ALB ($32/month) from the ADR-0084 §10 estimate. Operator decision Q1 (locked 2026-05-12) confirmed API Gateway HTTP API. This brings the steady-state cost down from ~$43/month to ~$13/month.

v1 resources decommissioned after 14-day overlap:


10. Decommission v1

After day 21 (14-day overlap cleared), decommission checklist:

Step 1 — Confirm v2 health (rollback gate): - Verify all 10 R53 health checks HEALTHY. - Verify curl https://burr.raxx.app/oidc/.well-known/openid-configuration returns HTTP 200 with "issuer":"https://burr.raxx.app/oidc". - Verify Infisical interactive login completes via Burr v2. - Rollback: if any check fails, do NOT proceed. Revert Infisical SSO config to v1 values from SSM backup.

Step 2 — Disable v1 CF Access SaaS application:

terraform -chdir=terraform/modules/sso-oidc-gateway plan -target cloudflare_access_application.infisical_oidc -var 'enabled=false'
# Review plan — confirms only the one app resource is modified
terraform -chdir=terraform/modules/sso-oidc-gateway apply -target cloudflare_access_application.infisical_oidc

Rollback: re-enable the CF Access app (set enabled=true) and revert Infisical SSO config.

Step 3 — Archive v1 SSM secrets (90-day deletion schedule): - Tag /raxx/cf-access/infisical_oidc_client_secret and /raxx/cf-access/infisical_oidc_client_id with ScheduledDeletion: <date+90days>. - Do not delete immediately — the 90-day archive provides last-resort rollback evidence. - Rollback window: restore tag + update Infisical SSO config (v1 credentials still in SSM until scheduled deletion date).

Step 4 — Confirm Infisical sessions still working: - 24 hours post-decommission, verify no Infisical SSO errors in CloudWatch logs or ops@ inbox. - Rollback: if Infisical sessions fail, check whether the v1 CF Access app was the active issuer somehow (it should not be, but confirm via Infisical admin UI).


11. Security / GDPR checklist


12. Alternatives considered

Alternative A: Keep Burr v1 permanently (CF Access as OIDC issuer)

Rejected because: CF Access zone availability becomes a single point of failure for all internal tooling SSO. Operator cannot control signing key rotation or perform independent health monitoring. Acceptable for one app at v1 scale; not acceptable post-launch with Grafana and additional internal tools joining.

Alternative B: Migrate to a third-party OIDC SaaS (Auth0, Okta)

Rejected because: Adds a paid third-party dependency for an internal-tooling-only use case. Operator-scale traffic (tens of sessions per day) does not justify the cost or vendor lock-in. Building on Lambda + KMS keeps the attack surface under operator control and avoids SaaS IdP pricing risk.

Alternative C: ALB instead of API Gateway

Considered in ADR-0084 §10 open question Q1. Rejected post-ADR by operator decision (2026-05-12): API Gateway HTTP API costs ~$1/month vs ALB's ~$16/month/region at operator scale. API Gateway supports WAF attachment, custom domain names, and CloudWatch access logging identically to ALB for this use case.


13. Consequences

Positive

Negative / risks

Neutral


14. Open questions (require operator decision before cards can be claimed)

  1. AWS region selection confirmed? ADR-0084 and the 10 cards assume us-west-2 (primary) + us-east-1 (replica). If operator wants a different region pair (e.g., us-east-1 primary + eu-west-1 for GDPR-sensitive tooling future), the KMS MRK provisioning and ACM certificate configuration change. Confirm before #1879 is claimed.

  2. CloudFront price class? Card #1884 hardcodes PriceClass_100 (US + Europe only, cheapest). If operator wants global PoP coverage (PriceClass_All), update before #1884 is claimed. At internal-tooling scale PriceClass_100 is appropriate; surfacing for explicit confirmation.

  3. Monthly cost ceiling / auto-shutdown? No budget alarm is specified in the 10 cards. At ~$13/month there is no cost risk, but confirming whether operator wants a CloudWatch billing alarm (e.g., alert at $25/month) before #1887 is claimed would allow it to be included in that card's scope.

  4. Migration kickoff date? All 9 Terraform + code cards can be claimed immediately. The Infisical cutover (#1888) is the date-sensitive step. Operator should confirm the maintenance window (low-activity period) before #1888 moves to in-progress, per the active-session note in #1886.


15. References


Revisit when