ADR-0068 — Queue: Fail-Closed on Outage (No Credential Cache in Raptor)
Date: 2026-05-09 UTC
Status: Accepted
Deciders: software-architect
Context: docs/architecture/queue/design.md §Failure Modes (Invariant I-9)
Context
If Queue is unavailable (process crash, DB outage), what should Raptor do for requests carrying a valid in-flight JWT?
Option A — Fail-closed: Raptor rejects all unauthenticated requests with 503 when Queue is unreachable. In-flight JWTs that can be verified offline (RS256) continue to be honored until they expire (15 minutes). No new sessions can be created.
Option B — Fail-open (emergency credential cache): Raptor maintains a local credential cache (last-known permission set per customer_id) and falls back to the cache when Queue is down. New requests proceed with cached permissions.
Decision
Option A — Fail-closed.
During a Queue outage:
1. Existing in-flight JWTs (RS256 offline verifiable) continue to work for up to 15 minutes. This provides a brief grace window for customers already logged in.
2. New login attempts (POST /api/v1/auth/webauthn/login/complete) fail with 503.
3. Session refresh calls fail with 503.
4. After 15 minutes, all authenticated sessions expire. Customers see a login prompt.
Raptor does not cache permission sets or credential data locally beyond what is embedded in the JWT.
Consequences
Positive: - No stale-credential risk. A revoked session that was in a local cache would grant access after the revocation — unacceptable for a platform that handles financial data. - Simple to reason about. "Queue is the authority; no authority = no access" is an invariant that cannot be accidentally violated by a caching bug. - The 15-minute JWT TTL provides a soft grace window for customers already logged in during a brief outage (dyno restart, deploy).
Negative: - A Queue outage lasting more than 15 minutes locks out all customers. This is a hard availability dependency. - Mitigated by co-location in Phase 1 (Queue goes down only if the entire Raptor dyno crashes), and by Queue's independence in Phase 4 (its own dyno with its own health checks).
Alternatives Considered
Option B (fail-open with cache): Rejected on security grounds. A permission cache in Raptor would grant access based on stale data. If a user's account is suspended or a session is force-revoked (e.g., account compromise), the cache could serve the revoked session for the cache TTL. The risk of unauthorized access to financial data outweighs the availability benefit of a cached fallback.