Raxx · internal docs

internal · gated

ADR-0068 — Queue: Fail-Closed on Outage (No Credential Cache in Raptor)

Date: 2026-05-09 UTC Status: Accepted Deciders: software-architect Context: docs/architecture/queue/design.md §Failure Modes (Invariant I-9)


Context

If Queue is unavailable (process crash, DB outage), what should Raptor do for requests carrying a valid in-flight JWT?

Option A — Fail-closed: Raptor rejects all unauthenticated requests with 503 when Queue is unreachable. In-flight JWTs that can be verified offline (RS256) continue to be honored until they expire (15 minutes). No new sessions can be created.

Option B — Fail-open (emergency credential cache): Raptor maintains a local credential cache (last-known permission set per customer_id) and falls back to the cache when Queue is down. New requests proceed with cached permissions.


Decision

Option A — Fail-closed.

During a Queue outage: 1. Existing in-flight JWTs (RS256 offline verifiable) continue to work for up to 15 minutes. This provides a brief grace window for customers already logged in. 2. New login attempts (POST /api/v1/auth/webauthn/login/complete) fail with 503. 3. Session refresh calls fail with 503. 4. After 15 minutes, all authenticated sessions expire. Customers see a login prompt.

Raptor does not cache permission sets or credential data locally beyond what is embedded in the JWT.


Consequences

Positive: - No stale-credential risk. A revoked session that was in a local cache would grant access after the revocation — unacceptable for a platform that handles financial data. - Simple to reason about. "Queue is the authority; no authority = no access" is an invariant that cannot be accidentally violated by a caching bug. - The 15-minute JWT TTL provides a soft grace window for customers already logged in during a brief outage (dyno restart, deploy).

Negative: - A Queue outage lasting more than 15 minutes locks out all customers. This is a hard availability dependency. - Mitigated by co-location in Phase 1 (Queue goes down only if the entire Raptor dyno crashes), and by Queue's independence in Phase 4 (its own dyno with its own health checks).


Alternatives Considered

Option B (fail-open with cache): Rejected on security grounds. A permission cache in Raptor would grant access based on stale data. If a user's account is suspended or a session is force-revoked (e.g., account compromise), the cache could serve the revoked session for the cache TTL. The risk of unauthorized access to financial data outweighs the availability benefit of a cached fallback.