Raxx · internal docs

internal · gated

ADR-0084 — Burr v2: Multi-Region OIDC Gateway with R53 Latency Routing + Auth Down Failover

Status: Accepted Date: 2026-05-12 UTC Deciders: Kristerpher (operator) Scope: Post-launch Burr v2 architecture (target: post 2026-05-23) Parent card: #1867 Refs: ADR-0082 (Terraform pipeline pattern), ADR-0083 (Infisical Google OIDC SSO via CF Access), #1859, #1864, #1866, #1868 (Auth Down UX), #1869 (security audit), terraform/modules/sso-oidc-gateway/


1. Context

What Burr v1 is

Burr v1 (PR #1866, ADR-0083) turns Cloudflare Access into an OIDC provider for downstream internal apps. CF Access, backed by Google Workspace, issues signed JWTs. Downstream apps — Infisical first, Grafana and future internal tools next — receive identity assertions without needing their own Google OAuth client registrations.

The v1 module (terraform/modules/sso-oidc-gateway/) is a Terraform-managed CF Zero Trust SaaS application (type = saas, auth_type = oidc) that exposes CF-standard OIDC endpoints:

issuer:                 https://<account_id>.cloudflareaccess.com
authorization_endpoint: .../sso/oidc/<client_id>/authorization
token_endpoint:         .../sso/oidc/<client_id>/token
userinfo_endpoint:      .../sso/oidc/<client_id>/userinfo
jwks_uri:               .../cdn-cgi/access/certs

What v1 does not give us

Burr v1 sits entirely inside Cloudflare's global network. It does not have a regional deployment; there is no operator-controlled compute that can be replicated. A CF Access zone outage — partial or full — takes down all downstream SSO. For a single Infisical consumer, that is an acceptable v1 trade-off. Post-launch, as the number of downstream apps grows, the single zone becomes a hard dependency on CF's reliability posture.

The specific failure modes v1 does not address:

Why now (design only — not yet implementation)

v1 ships 2026-05-12 as part of the pre-launch sequence. v2 is post-launch work (post-2026-05-23). The design is authored now so implementation sub-cards are ready to claim after launch. The ADR establishes the architecture contract; implementation sub-cards will reference it.


2. Invariants

The following invariants from the platform's non-negotiable constraints apply to this design:


3. Architecture decision summary

Deploy Burr v2 as a self-owned OIDC issuer running on AWS Lambda (Node.js or Python, choice deferred to implementer) in both us-west-2 and us-east-1. Each regional deployment is an independent compute unit with its own ALB. KMS multi-region keys provide OIDC signing key continuity across regions with a single logical key identity. Route53 latency-based routing with per-region health checks provides active-active serving when both regions are healthy and automatic failover to the surviving region when one degrades. A CloudFront + S3 static Auth Down page is the tertiary failover target when both regions are simultaneously unhealthy.

CF Access continues to serve as the upstream identity provider (Google Workspace IdP) — Burr v2 replaces CF Access as the OIDC issuer seen by downstream apps, while still delegating the upstream human authentication step to CF Access and Google.


4. Per-region stack

Compute: Lambda behind ALB (not Fargate)

Lambda is chosen over Fargate for the following reasons:

Regional stack per region (us-west-2, us-east-1):

Internet Gateway / ALB (HTTPS:443, TLS terminated)
  |
  +-- Lambda: burr-oidc-handler
        |
        +-- /oidc/.well-known/openid-configuration  (GET, unauthenticated)
        +-- /oidc/.well-known/jwks.json             (GET, unauthenticated)
        +-- /oidc/authorize                         (GET, initiates CF Access redirect)
        +-- /oidc/token                             (POST, exchanges CF Access code for Burr JWT)
        +-- /oidc/userinfo                          (GET, bearer token required)
        +-- /health                                 (GET, used by R53 health check)

The ALB listener rules map path prefixes to the Lambda function. The Lambda uses provisioned concurrency (minimum 1 instance) so the health-check endpoint never cold-starts.

ALB configuration

KMS multi-region key strategy

OIDC JWTs must be signed with a stable key whose public component is discoverable via the JWKS endpoint. The key identifier (kid) in the JWKS must be consistent across both regions so that tokens issued by either region can be verified against either JWKS endpoint.

Decision: KMS multi-region key (primary in us-west-2, replica in us-east-1).

A KMS multi-region key is a single logical key (mrk-*) whose key material is replicated by KMS across specified regions. Both regional Lambda instances call kms:Sign against their region-local replica. Both replicas share the same key material and therefore produce JWTs that verify against the same public key. The JWKS endpoint in both regions returns the same public key material with the same kid.

Key lifecycle:

Secret distribution:


5. R53 hosted zone design

Decision: records on raxx.app zone, not a separate burr.raxx.app zone

Rationale:

raxx.app DNS is managed by Cloudflare (nameservers delegated to CF). Creating a separate delegated zone for burr.raxx.app would require a Route53 hosted zone whose NS records are added to the raxx.app CF zone as a delegation. This is achievable but adds a layer: any CF zone incident could also affect the NS delegation lookup.

The alternative — keeping records inside the Cloudflare-managed raxx.app zone — means CF is in the DNS resolution path for burr.raxx.app. If CF has a zone outage, DNS for the OIDC endpoints fails even though the ALBs are healthy.

Chosen approach: Create a Route53 public hosted zone for burr.raxx.app and add NS delegation records in the Cloudflare raxx.app zone pointing to R53. This decouples the OIDC endpoint DNS from CF's DNS availability. A CF zone outage no longer affects resolution of burr.raxx.app — the NS delegation is cached by resolvers and R53 continues to serve the zone independently.

The raxx.app apex and all other subdomains remain on Cloudflare. Only the burr.raxx.app subtree delegates to R53.

Zone structure:

raxx.app (CF-managed)
  burr.raxx.app   NS → R53 hosted zone (4 NS records from R53)

R53 hosted zone: burr.raxx.app
  burr.raxx.app                  LATENCY   → ALB alias per region (active-active)
  us-west-2.burr.raxx.app        A/ALIAS   → us-west-2 ALB
  us-east-1.burr.raxx.app        A/ALIAS   → us-east-1 ALB
  burr.raxx.app (failover)       ALIAS     → CloudFront Auth Down distribution

TTL: All latency-based records use TTL=60. The failover record TTL is irrelevant (it is an R53 Alias record and inherits the target's TTL behavior). Health check poll interval: 10 seconds. Failure threshold: 3 consecutive failures → region marked unhealthy. Expected failover time: 30–60 seconds from regional degradation to R53 stopping traffic to that region.


6. Health checks

R53 health checks are defined per region. Each region requires all of the following checks to pass to be considered healthy. R53 health checks are composed using a calculated health check (AND logic) so that a single endpoint failure marks the region unhealthy.

HC-1: OIDC discovery document

HC-2: JWKS endpoint

HC-3: Token endpoint rejects malformed input without 5xx

HC-4: Lambda /health endpoint

HC-5: Calculated health check (the R53 gate)

R53 calculated health check (AND logic) across HC-1, HC-2, HC-3, HC-4. Only when all four pass is the region considered healthy by R53. A single sub-check failure routes traffic to the other region.

Duplicate the above four health checks for us-east-1.burr.raxx.app. Total: 8 health checks + 2 calculated checks = 10 health check resources.


7. Failover logic

Scenario A: one region unhealthy

R53 latency-based routing stops sending new DNS responses that alias to the unhealthy region's ALB alias. The surviving region absorbs 100% of new requests. In-flight sessions at the failed region are lost (OIDC is stateless; clients will re-authenticate via the surviving region). No operator action required.

Scenario B: both regions unhealthy (calculated health checks both failing)

R53 routes to the tertiary failover record: a CloudFront distribution serving the static Auth Down page. The failover record is an R53 Alias pointing to the CloudFront distribution, with routing policy = failover, record type = secondary. The primary records (latency-based, one per region) are the primary failover group. When both primaries are unhealthy, R53 falls back to the secondary.

Downstream apps (Infisical, etc.) will receive a non-OIDC response (HTML static page) and will surface an error to the operator. This is the intended degraded-state UX — the static page instructs the operator to check status.raxx.app.

Scenario C: CF Access zone outage (not a regional Burr outage)

Burr v2's Lambda and ALB remain healthy. R53 health checks pass. However, the authorization step of the OIDC flow redirects the user to CF Access (cloudflareaccess.com), which is unavailable. The token exchange fails. Burr's /oidc/authorize endpoint returns a 302 to CF Access; if CF Access is down, the browser receives a timeout or CF error page.

Mitigation: Burr v2 can detect CF Access unreachability during the /health check's Google OIDC upstream probe (since CF Access proxies through Google). The health check marks the region unhealthy if the upstream probe fails. This causes R53 to route to the other region — but if CF Access is globally down, both regions will fail HC-4 and traffic falls to Auth Down. This is the correct behavior: if the upstream identity provider is unavailable, Burr cannot issue tokens regardless of compute health.


8. Auth Down integration

Static page delivery

The Auth Down page is a CloudFront distribution backed by an S3 bucket (raxx-auth-down-static). No React, no JavaScript framework, no server-side rendering. The page must be servable when the entire backend stack is down.

S3 bucket policy: public read for the specific object path only (/auth-down/index.html, /auth-down/styles.css). No bucket-level public access. CloudFront Origin Access Control (OAC) is the delivery mechanism.

CloudFront configuration:

Page content (copy):

Raxx — Authentication temporarily unavailable

We're working on it.

Current status and updates: status.raxx.app

If you need immediate assistance: support@raxx.app

No session state, no cookies, no JavaScript. Static HTML + minimal inline CSS only.

Cache-Control header rationale: CloudFront default TTL would serve a cached Auth Down page to returning users even after recovery. no-store prevents all caching layers from retaining the page, so the first request after recovery hits the live R53 record and resolves to the healthy Burr region.


9. OIDC client migration path

Current state (Burr v1)

Infisical is configured with: - issuer: https://<CF_ACCOUNT_ID>.cloudflareaccess.com - client_id: CF Access SaaS application client ID (output from terraform/modules/sso-oidc-gateway/) - client_secret: CF Access SaaS application client secret (in SSM at /raxx/cf-access/infisical_oidc_client_secret) - jwks_uri: https://<CF_ACCOUNT_ID>.cloudflareaccess.com/cdn-cgi/access/certs

Migration to Burr v2

When Burr v2 is deployed:

  1. Burr v2 is deployed and its health checks pass in both regions before any client migration.
  2. A new OIDC client registration for Infisical is created in Burr v2 (SSM: /raxx/burr/clients/infisical/client_secret). Burr v2 issues a new client_id and client_secret.
  3. Infisical's SSO config is updated with: - issuer: https://burr.raxx.app/oidc - client_id: the new Burr v2 client ID - client_secret: from Burr v2 registration - jwks_uri: https://burr.raxx.app/oidc/.well-known/jwks.json
  4. The old CF Access SaaS application (Burr v1 Infisical instance) is kept alive for 7 days as a rollback path, then decommissioned.

Rollback if Burr v2 fails post-migration

  1. Revert Infisical SSO config to Burr v1 values (v1 issuer, client_id, client_secret from SSM).
  2. All active Infisical sessions using Burr v2 tokens will expire and require re-authentication. Sessions are short-lived (24h max per CF Access policy); worst case the operator waits up to 24h for full rollback without forced session invalidation, or triggers manual session invalidation in Infisical.
  3. Burr v2 can be left running in parallel (zero downstream clients) while the issue is diagnosed.

Migration for future downstream apps

Each new downstream app (Grafana, future internal tools) follows the same pattern: register a new client in Burr v2 first, configure the app to use Burr v2 OIDC, validate, then decommission the corresponding Burr v1 CF Access SaaS application.


10. Cost estimate

Assumptions: Lambda provisioned concurrency = 1 per region. Lambda duration: 50ms average per request. Memory: 256 MB.

v1 scale: one OIDC client (Infisical only)

10x scale: 10 downstream OIDC clients

Note: The ALB cost ($32/month) dominates at small scale. If cost is a concern, API Gateway (HTTP API) can replace ALB at ~$1/month at this traffic level. This is flagged as an open question (§13) for implementer decision.


11. Risks and mitigations

Risk Blast radius Mitigation
KMS multi-region key replication lag Token signing fails in replica region during replication event KMS replication is synchronous for Sign operations; replica key is always available once provisioned. Health check HC-4 catches KMS unavailability before R53 routes to the region.
Lambda cold start on health check HC-3 (token POST) gets a 5xx during cold start, triggering false-positive failover Provisioned concurrency eliminates cold starts on the health-check path.
R53 health check DNS resolution failure Health check probes cannot reach regional ALB DNS Health checks use IP-address-based checks against the ALB, not DNS lookups, wherever R53 supports it. ALB IPs are stable (ALB nodes).
Auth Down page served stale after recovery Clients see Auth Down for minutes after Burr recovers Cache-Control: no-store on CloudFront distribution; no CDN caching of the failover page. R53 TTL=60 means clients refresh DNS within 60 seconds of recovery.
NS delegation in CF zone becomes stale CF zone change removes or corrupts the burr.raxx.app NS delegation NS delegation is Terraform-managed (cf-access root or a new burr-dns root). Protected by branch protection + plan-before-apply pipeline (ADR-0082).
OIDC client secret rotation disrupts downstream app Client secret rotation invalidates all issued tokens mid-session Rotation follows a mint-new → configure-downstream → revoke-old pattern (same model as Velvet ADR-0037). Old secret remains valid for 7 days during rotation window.
Burr v2 issues token with longer lifetime than v1 Downstream app accepts overly-long-lived tokens, reducing revocability Token lifetime is a config parameter per client registration. Default: 1 hour access token, 24 hour session. Mirroring CF Access session_duration from v1.
Single AWS account means no account-level isolation A compromise of one region's IAM role could affect both IAM roles are scoped per-region and per-function. KMS key policy restricts Sign permission to the Lambda execution role only.

12. Out of scope for this ADR


13. Open questions

  1. ALB vs API Gateway. ALB at $16/month/region ($32 total) is the dominant cost at small scale. API Gateway HTTP API would cost ~$0.50/month at this traffic level. Trade-off: API Gateway adds a third-party in the request path; ALB gives lower latency and more granular WAF integration. Implementer should confirm cost posture with operator before provisioning ALBs.

  2. VPC or not. Lambda can run in a VPC (for KMS + SSM private endpoint access) or without a VPC (uses public KMS/SSM endpoints, faster cold starts). Running in a VPC adds NAT Gateway cost (~$32/month/region if NAT GW is needed for outbound). Running without a VPC is simpler and has the same security properties if security groups and endpoint policies are correct. Decision needed before implementation.

  3. R53 health check IP vs hostname. R53 HTTPS health checks against an ALB hostname vs IP address have different behavior. IP-based checks bypass DNS but require the ALB's TLS certificate to match the hostname in the Host header. Confirm R53 supports hostname-in-header override for HTTPS health checks (it does; host field on the health check resource), and that the ALB certificate covers the per-region subdomain.

  4. CF Access upstream probe in HC-4. The /health endpoint's Google OIDC discovery check adds latency to the health check response. If Google returns a slow response, the R53 health check may time out and trigger false-positive failover. Consider caching the upstream probe result in Lambda memory (5-minute TTL) and only failing HC-4 if the cached result indicates Google has been unreachable for >5 minutes. Implementer decision.

  5. Burr v1 decommission timeline. After v2 migration, the CF Access SaaS application (Burr v1) should be decommissioned. A 7-day overlap window is proposed above. Confirm with operator whether a longer overlap is preferred for the initial Infisical migration.


14. Sequence diagram

sequenceDiagram
    participant Client as Downstream App<br/>(Infisical)
    participant R53 as Route53<br/>burr.raxx.app
    participant BurrW as Burr Lambda<br/>us-west-2
    participant BurrE as Burr Lambda<br/>us-east-1
    participant AuthDown as CloudFront<br/>Auth Down
    participant CF as Cloudflare Access<br/>(upstream IdP)
    participant KMS as KMS<br/>(MRK Sign)

    Note over R53: Healthy: both regions passing HCs
    Client->>R53: DNS lookup burr.raxx.app
    R53-->>Client: ALB alias (latency: us-west-2 wins)
    Client->>BurrW: GET /oidc/.well-known/openid-configuration
    BurrW-->>Client: 200 + discovery doc

    Client->>BurrW: GET /oidc/authorize?...
    BurrW-->>Client: 302 → CF Access authorization
    Client->>CF: Google Workspace login
    CF-->>Client: 302 → BurrW /oidc/token?code=...
    Client->>BurrW: POST /oidc/token (code exchange)
    BurrW->>KMS: Sign(JWT payload, mrk-key)
    KMS-->>BurrW: Signed JWT
    BurrW-->>Client: access_token + id_token

    Note over R53,BurrW: us-west-2 HC fails
    R53->>R53: Stop routing to us-west-2
    Client->>R53: DNS lookup burr.raxx.app
    R53-->>Client: ALB alias (us-east-1 only)
    Client->>BurrE: POST /oidc/token
    BurrE->>KMS: Sign(JWT payload, mrk-key replica)
    KMS-->>BurrE: Signed JWT (same key material)
    BurrE-->>Client: access_token + id_token

    Note over R53,BurrE: Both regions HCs fail
    R53->>R53: Route to failover record
    Client->>R53: DNS lookup burr.raxx.app
    R53-->>Client: CloudFront alias (Auth Down)
    Client->>AuthDown: GET /
    AuthDown-->>Client: 200 Static Auth Down page

15. Cross-references


16. Rollout plan

Phase Gate Description
Dark Post-2026-05-23 launch Implementation sub-cards filed and groomed; no infra changes
Provision Sub-card gate KMS MRK, ALBs, Lambda functions deployed in both regions; no DNS changes; internal smoke test via per-region hostnames (us-west-2.burr.raxx.app, us-east-1.burr.raxx.app)
Health-check validation All 10 HCs green for 48h R53 health check resources created; monitor but do not route public traffic yet
Soft cutover Operator confirmation NS delegation added to CF raxx.app zone; burr.raxx.app begins serving from R53; Burr v1 CF Access OIDC app kept alive
Infisical migration Operator action Update Infisical SSO config to Burr v2 endpoints; validate interactive login
v1 decommission 7-day overlap Remove Burr v1 CF Access SaaS application for Infisical; keep module for other CF-Access-as-issuer use cases