Raxx · internal docs

internal · gated

JWT Access-Token TTL + Refresh-Token + Revocation Research

Queue Identity Service — Security Architecture Decision

Status: Research complete — values locked for ADR-0067 amendment Date: 2026-05-09 UTC Author: security-agent Operator working values heading in: AT TTL = 15 min, RT TTL = 30 days Final recommendations: AT TTL = 15 min (confirmed), RT TTL = 7 days (counter-recommend), rotate on every use, revocation = wait-for-expiry + targeted session-id blocklist


1. Access-Token TTL Recommendation

Locked value: 15 minutes

Industry survey

Provider / Standard Access Token TTL Notes
Auth0 (default) 86,400 s (24 h) Default is intentionally generous; their own best-practice docs say "keep short"; most fintech implementors override to 15–60 min
AWS Cognito (default) 3,600 s (1 h) Configurable; Cognito added refresh-token rotation April 2025
Okta (default) 3,600 s (1 h) SPA-specific guidance says reduce TTL; fintech clients typically set 15–30 min
Schwab Trader API (OAuth 2.0) 1,800 s (30 min) Refresh token TTL = 7 days with mandatory manual re-auth thereafter; public developer docs confirm 30-minute access token
OWASP JWT Cheat Sheet 15–30 min "Short expiration times, 15–30 min idle timeout, 8 h absolute timeout" — direct recommendation
IETF RFC 9700 (OAuth 2.0 BCP) No specific number "Access tokens SHOULD be audience-restricted and privilege-minimized"; shorter = lower blast radius on theft
Industry consensus (Auth0, Snyk, Curity, DevToolKit security surveys 2024–2026) 5–15 min Cited convergence point for high-value APIs

Dimension analysis

Revocation lag. With offline JWT verification and no blocklist, a revoked access token remains valid until exp. At 15 minutes, worst-case revocation lag is 14:59. At 5 minutes it is 4:59; at 30 minutes it is 29:59. The 15-minute value is the accepted industry midpoint between usability (client-visible refresh churn) and revocation lag. For Raxx's threat model — a customer-facing trading platform where session revocation is an incident-response action, not a routine operation — 15 minutes is appropriate. If an anomaly system triggers automated session kill (see scenario 5), 15 minutes is the upper bound on exposure.

Refresh-token churn. A client that receives a 15-minute access token will call Queue's refresh endpoint once every 15 minutes during an active session. At v1 scale (< 1,000 concurrent sessions), this is negligible load. At 5-minute TTL, churn triples — meaningful only if Queue becomes a standalone service under load, which is post-Phase 4.

User experience. Silent background refresh by the client library (Antlers calls the refresh endpoint proactively before exp) means users never see an auth interruption unless the refresh token itself has expired or been revoked. Access token TTL is invisible to users when implemented correctly.

Token theft window. If an attacker exfiltrates an access token from an XSS payload, memory dump, or transport interception, they have at most 15 minutes of valid use. Because tokens are scoped to a specific aud (Raptor) and carry sub + sid, they cannot be replayed across services.

Token theft window — comparison. At 30-minute TTL, theft window doubles. At 60 minutes, theft window is comparable to a typical banking session (Schwab's threshold). Raxx handles real financial execution actions, so 15 minutes is the tighter and more appropriate bound.

Securities-adjacent norms. Schwab Trader API uses 30 minutes; this is the longest defensible value for a brokerage-adjacent API. Raxx's 15-minute value is more conservative by design — appropriate given the paper_first_gate claim that enables live trading paths.

Sources

  1. OWASP JSON Web Token Cheat Sheet: "short expiration times (15–30 min idle timeout)" — https://cheatsheetseries.owasp.org/cheatsheets/JSON_Web_Token_for_Java_Cheat_Sheet.html
  2. Schwab Trader API OAuth documentation: access token expires 30 minutes post-issue — https://developer.schwab.com/user-guides/get-started/authenticate-with-oauth
  3. Auth0 Token Best Practices: "Give tokens an expiration"; industry implementors converge on 15 min for high-value APIs — https://auth0.com/docs/secure/tokens/token-best-practices
  4. Curity / DevToolKit 2026 security survey: "5–15 min is the current consensus for high-value API access tokens" — https://curity.io/resources/learn/jwt-best-practices/

Verdict: 15 minutes confirmed. No change from operator's working value.


2. Refresh-Token TTL + Rotation Policy

TTL recommendation: 7 days (counter-recommend from operator's 30 days)

Operator working value: 30 days. Recommendation: 7 days.

Rationale for 7 days over 30 days

The refresh token is the higher-value secret. An attacker who steals a refresh token can mint access tokens indefinitely until the RT expires or is revoked. With a 30-day RT TTL and no rotation, an attacker can silently re-authenticate for up to 30 days before forced re-auth exposes them. With a 7-day TTL:

Counter-recommendation justification

The 30-day value is appropriate for server-to-server integrations or native apps with secure key storage (iOS Keychain, Android Keystore). For Antlers (browser-based), the RT lives in an HttpOnly cookie — more secure than localStorage but still subject to CSRF attack surface if same-site cookie policy is misconfigured. 7 days provides a materially tighter window at the cost of one extra passkey login per week for customers who do not use Raxx daily (which at v1 is acceptable — the passkey experience is fast).

If the operator decides to maintain 30 days, the compensating control is mandatory rotation-on-every-use + theft detection (see below), and the ADR amendment should note the 23-day residual risk window explicitly.

Never-rotate (static RT): Rejected. A stolen RT is silently valid for the full TTL with no detection mechanism.

Rotate every N refreshes: Rejected. Provides partial detection but leaves a window of N uses where theft goes undetected.

Rotate on every refresh use: Recommended. On each refresh, Queue: 1. Validates the presented RT against queue_sessions.token_hash. 2. Issues a new AT + new RT. 3. Invalidates (marks revoked_at) the presented RT immediately. 4. If the old RT is presented again (reuse detection), Queue invalidates the entire session chain — both the old and newly issued RT.

Grace period: Queue accepts the previous RT for 30 seconds after rotation to accommodate mobile/slow-network retries. This is the Okta default grace period and the AWS Cognito reuse window.

Theft detection via reuse detection. If an attacker steals a RT and uses it while the legitimate user's client still holds the previous RT, the legitimate client's next refresh attempt will present the (now-revoked) old RT. Queue detects this as a reuse event: the session chain is invalidated entirely, and an audit event session.rt_reuse_detected is emitted. The customer is effectively logged out and the attacker's access terminates simultaneously. This mechanism is codified in RFC 6819 §5.2.2.3 and implemented by Okta and AWS Cognito as of April 2025.

Storage on client

Sources

  1. Okta Refresh Token Guide: rotation defaults for SPAs, 30-second grace period, reuse detection that "immediately invalidates the most recently issued refresh token and all access tokens" — https://developer.okta.com/docs/guides/refresh-tokens/main/
  2. RFC 6819 §5.2.2.3: "Refresh tokens can automatically be replaced in order to detect unauthorized token usage by another party" — https://www.rfc-editor.org/rfc/rfc6819.html
  3. AWS Cognito Refresh Token Rotation (April 2025): "previous token can remain valid for up to 60 seconds to allow for client-side retries"; best practice to reduce RT TTL when rotation is enabled — https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-cognito-refresh-token-rotation/
  4. Schwab Trader API: refresh token TTL = 7 days, mandatory re-auth after expiry — https://developer.schwab.com/user-guides/apis-and-apps/oauth-restart-vs-refresh-token

Locked values: - RT TTL: 7 days (counter-recommended from 30 days; see disagreement flag in section 6) - Rotation: on every use - Grace period: 30 seconds - Reuse detection: full session-chain invalidation on duplicate RT presentation


3. Revocation Strategy

Recommended: wait-for-expiry (primary) + targeted session-id Redis blocklist (incident response only)

Three patterns evaluated

Pattern A: Wait for expiry

Access token expires in ≤15 minutes. No real-time revocation mechanism. Raptor verifies the JWT offline and it either passes or fails.

Pros: Zero added dependencies. Zero per-request overhead. Fully consistent with fail-closed posture (ADR-0068) — Queue outages don't affect Raptor's ability to verify in-flight tokens.

Cons: 15-minute revocation lag. A force-revoked session (account compromise, support action) remains technically valid for up to 15 minutes.

Fits Raxx's threat model: Yes, for normal operations. A routine logout, idle timeout, or voluntary session end where 15 minutes of residual validity is acceptable. At v1 customer scale, forced-revocation events are rare (incident response, not daily ops).

Pattern B: Push revocation event

Queue publishes a session.revoked event to a pub/sub channel (Redis pub/sub or a lightweight in-process event). Raptor subscribes and maintains an in-memory set of session_id values revoked in the last 15 minutes (the AT TTL window).

Pros: Near-real-time revocation (sub-second propagation). No per-request Redis call.

Cons: Adds pub/sub infrastructure (Redis at minimum). In-memory revocation set on Raptor is lost on dyno restart. Co-location mitigates the pub/sub overhead but adds complexity. In Phase 4 (standalone services), this becomes a cross-service message bus dependency.

Fits Raxx's threat model: Overkill for v1. The co-location architecture (Queue and Raptor on the same dyno) means a Queue crash kills both anyway — the 15-minute window is irrelevant if the whole dyno is down.

Pattern C: Redis blocklist (per-request check)

Raptor checks Redis on every authenticated request for the session_id or jti claim. If found in the blocklist, request is rejected immediately.

Pros: Instant revocation.

Cons: Adds Redis as a hard dependency to every API request. If Redis is unreachable and Raptor fails-closed (consistent with ADR-0068), this means Redis downtime = Raptor downtime. If Raptor fails-open when Redis is down, the revocation control is defeated. Neither outcome is acceptable.

The latency overhead is 0.1–0.3ms per request on a local subnet, but this is a permanent structural dependency added for a use case (instant revocation) that is rare.

Fits Raxx's threat model: Not appropriate for v1 given resilience constraints. ADR-0068 explicitly rejects adding per-request dependencies on additional services. Redis blocklist is the right answer if Raxx reaches a scale where breach-triggered instant revocation is a regulatory requirement — deferred to post-v1 policy review.

Recommendation

Primary: wait-for-expiry. The 15-minute AT TTL is the revocation control. Paired with: 1. Refresh-token rotation + reuse detection (Section 2) — if an attacker steals a RT, the next legitimate refresh triggers detection and full session invalidation. 2. Immediate RT invalidation on explicit logout — Queue marks queue_sessions.revoked_at at logout time. The AT may still be valid for ≤15 minutes but the attacker cannot refresh, so their window is bounded by the remaining AT lifetime at the time of logout.

Incident response (targeted blocklist): If a specific session is compromised and the operator requires instant revocation (not waiting up to 15 minutes), Queue writes the session_id to a Redis sorted set with a 15-minute TTL and Raptor checks this set on each request. This is a narrow, optionally deployed path — not part of the hot path. It is enabled via FLAG_QUEUE_INSTANT_REVOKE and is off by default. When enabled, Redis must be in the same availability zone as Raptor to avoid introducing a cross-region dependency.

This hybrid is the minimal-complexity path that satisfies both the resilience constraint (Redis is optional, not on every request by default) and the incident-response use case.

Sources

  1. OWASP JWT Cheat Sheet: "implement a token denylist that will be used to mimic the logout feature"; notes the blocklist "only needs to be maintained until the token has lapsed" — https://cheatsheetseries.owasp.org/cheatsheets/JSON_Web_Token_for_Java_Cheat_Sheet.html
  2. ADR-0068 (Queue: Fail-Closed on Outage): "No credential cache in Raptor that grants access when Queue is down" — the Redis-on-every-request pattern directly contradicts this invariant
  3. Token Revocation Service design: Redis blocklist with 0.1–0.3ms P99 overhead acceptable only when Redis is treated as a soft dependency (fallback = allow, not fail-closed) — https://www.techinterview.org/post/3233469926/lld-token-revocation/

4. Scenario Walkthroughs

Configuration assumed: AT TTL = 15 min, RT TTL = 7 days, rotate-on-every-use, reuse detection enabled, wait-for-expiry primary revocation.

Scenario 1: Customer logs out

  1. Customer clicks logout in Antlers.
  2. Antlers calls POST /api/v1/auth/sessions/logout on Queue with the AT (and optionally the RT cookie).
  3. Queue sets queue_sessions.revoked_at = now() for the session. The RT is immediately invalid — any subsequent refresh attempt returns 401.
  4. The current AT remains cryptographically valid for up to 15 minutes (whatever is left in its TTL). Antlers deletes the AT from memory immediately.
  5. If the AT was captured pre-logout (e.g., by an XSS payload), the attacker has ≤15 minutes of remaining AT validity, after which the token rejects. They cannot refresh because the RT is invalidated.
  6. Effective revocation from the user's perspective: immediate (no new requests succeed from Antlers). Full cryptographic expiry: ≤15 minutes.

Scenario 2: Operator force-revokes a customer's session (incident response)

  1. Operator identifies a compromised session via console audit logs.
  2. Operator calls DELETE /api/v1/admin/sessions/{session_id} via Console → Queue admin API.
  3. Queue sets queue_sessions.revoked_at = now().
  4. If FLAG_QUEUE_INSTANT_REVOKE is enabled: Queue writes session_id to the Redis blocklist with TTL = remaining AT lifetime. Raptor rejects the AT on next request. Revocation lag: < 1 second.
  5. If FLAG_QUEUE_INSTANT_REVOKE is disabled (default v1): RT is immediately invalid. AT remains valid for ≤15 minutes. After that window, the session is completely dead.
  6. Recommended for incident response: enable FLAG_QUEUE_INSTANT_REVOKE before triggering force-revoke. This is a one-time config change per incident, not a permanent hot-path addition.

Scenario 3: Customer changes password / passkey

Raxx has no password path (I-2). On passkey change (add new credential + remove old): 1. Customer calls POST /api/v1/auth/credentials/webauthn/add (step-up gated — requires fresh fresh_until). 2. Post-registration, operator policy determines whether all other sessions should be revoked. Recommended policy: revoke all other sessions (retain only the session that performed the credential change). 3. Queue marks all other queue_sessions rows for the customer_id with revoked_at = now(). 4. Existing ATs for revoked sessions remain valid for ≤15 minutes. Since the customer can see their active sessions in the Antlers session manager, they can initiate the revoke explicitly. 5. If the passkey change was triggered by a suspected account takeover (the legitimate user is changing credentials because they believe an attacker accessed their account), the operator should enable FLAG_QUEUE_INSTANT_REVOKE to kill the attacker's AT before the 15-minute window.

Scenario 4: Customer loses a device — support revokes via console

  1. Customer contacts support. Support agent opens FreeScout ticket.
  2. Agent accesses console, navigates to customer session manager (Queue admin API: GET /api/v1/admin/customers/{customer_id}/sessions).
  3. Agent identifies the device session (by user_agent, ip_prefix, issued_at).
  4. Agent revokes the session: DELETE /api/v1/admin/sessions/{session_id}.
  5. RT invalidated immediately. AT valid for ≤15 minutes.
  6. Customer's device (even if attacker-controlled) cannot refresh. After 15 minutes, all access via that session terminates.
  7. Expected time from support call to effective lockout: < 5 minutes (support action time) + ≤15 minutes (AT expiry) = ≤20 minutes.
  8. For high-severity incidents (stolen device with suspected active use), support enables FLAG_QUEUE_INSTANT_REVOKE before the revoke call.

Scenario 5: Suspected breach — automated session kill

  1. Anomaly system detects unusual behavior (geo-anomaly, rapid position change velocity, unusual IP prefix rotation).
  2. Anomaly system calls Queue's POST /api/v1/internal/sessions/revoke-all/{customer_id} with service token.
  3. Queue marks all active sessions for the customer as revoked_at = now(), emits breach.session_killed audit event.
  4. ops@raxx.app paged within 15 minutes (per design.md breach response section).
  5. 72-hour GDPR Art. 33 clock starts on first breach.* audit write.
  6. If the anomaly system also sets FLAG_QUEUE_INSTANT_REVOKE temporarily (it can call a flag-flip endpoint), all active ATs are dead within 1 second. The flag auto-expires after 30 minutes.
  7. Required latency: < 1 second from anomaly trigger to first session rejection (with instant revoke enabled); ≤15 minutes without it. Anomaly system should enable instant revoke for breach-triggered kills.

5. Pen-Test Threats and Residual Risks

Hard threats unique to JWT + mitigations in Queue's setup

alg:none attack

Threat: Attacker strips the JWT signature by setting "alg": "none" in the header. Libraries with permissive defaults accept unsigned tokens.

Queue's mitigation: Verification layer on Raptor explicitly specifies algorithms=["RS256"] at parse time. The none algorithm is not in the whitelist. A token with alg: none or any variant (NoNe, NONE) is rejected with 401 before any claim is inspected.

Residual risk: None, if the JWT library enforces the allowlist strictly. Verify: use PyJWT >= 2.4.0 (patched for alg confusion) with algorithms parameter enforced. Add a CI test that sends alg: none tokens and asserts 401.

Algorithm confusion (RS256 → HS256 key confusion)

Threat: Attacker obtains the RSA public key (from JWKS endpoint or env), treats it as an HMAC secret, and signs a forged HS256 token. If the verification layer accepts whatever alg the token claims, the forged token passes.

Queue's mitigation: Raptor's JWT verification hardcodes algorithms=["RS256"]. The public key is loaded as an RSA public key object, not a raw bytes HMAC secret. Correct RSA key type loading prevents using it as an HMAC secret even if the alg value were somehow accepted.

Residual risk: Low. Ensure the key loading path uses from_public_pem (not raw bytes). Add a CI test that sends an HS256-signed token using the public key as the secret and asserts 401.

JWKS poisoning / JWK header injection

Threat: Attacker embeds a crafted jwk or jku parameter in the JWT header pointing to an attacker-controlled key. If the verification layer fetches keys from URLs in the token header, it accepts a token signed by the attacker's key.

Queue's mitigation: Raptor does not use JWKS discovery at verify time. The public key is pinned: loaded from QUEUE_JWT_PUBLIC_KEY (env var, loaded at startup) and QUEUE_JWT_PUBLIC_KEY_PREV (dual-accept during rotation). No URL-based key fetching. Any jwk or jku header parameters in incoming tokens are ignored — the library is configured to use only the pre-loaded key.

Residual risk: None, given pinned keys. Document this in the security runbook explicitly so a future developer does not "improve" the code by adding JWKS discovery.

JWT replay attack

Threat: A stolen AT is replayed within its TTL. Since JWTs are stateless, the same token can be used many times.

Queue's mitigation: AT TTL is 15 minutes. Tokens carry sub + sid + jti. aud is pinned to raxx-raptor-v1. If the Redis blocklist is enabled, a specific jti can be blocklisted on logout. Without the blocklist, replay is possible within the 15-minute window.

Residual risk: Medium (v1, no blocklist). A stolen AT can be replayed for ≤15 minutes. Mitigated by short TTL and by token scoping (an AT stolen from an Antlers session cannot be replayed against Console, which uses CF Access entirely separately). For v1 at low scale, acceptable. If Raxx adds mTLS or DPoP sender-constraining (RFC 9449) in Phase 4, this risk drops to near-zero.

Kid (Key ID) injection

Threat: Attacker manipulates the kid header parameter to cause directory traversal, SQL injection, or selection of a different key than intended.

Queue's mitigation: Queue's JWT signing sets kid to a static identifier (queue-v1 or the key rotation index). Raptor's verifier matches the kid to one of the two pre-loaded public keys (QUEUE_JWT_PUBLIC_KEY / QUEUE_JWT_PUBLIC_KEY_PREV). If no match, the token is rejected. No database lookup or filesystem read on kid. No dynamic key loading.

Residual risk: None, given static key matching.

Weak signing key

Threat: RSA private key generated with insufficient entropy or short key length (< 2048 bits).

Queue's mitigation: Keys are generated via AWS KMS asymmetric key (RSA_2048 minimum; recommended RSA_4096). KMS handles entropy and key generation; private key material never leaves KMS (signing is done via KMS API). Raptor only receives the public key half.

Residual risk: None, given KMS-backed signing. Verify: confirm QUEUE_JWT_SIGNING_KEY is a KMS key ARN, not a raw PEM-encoded private key on the dyno. If the initial implementation uses a raw RSA key pair (simpler for Phase 1 co-location), ensure it is generated at >= 2048 bits and stored only in SSM.

exp / nbf / clock skew manipulation

Threat: Attacker manipulates the clock on the verification server to accept expired tokens.

Queue's mitigation: Clock skew tolerance on Raptor's JWT verifier is set to ≤30 seconds (PyJWT's leeway parameter). Heroku dynos sync to NTP. The 30-second tolerance is consistent with OWASP's recommendation of 30–60 seconds maximum.

Residual risk: Low. NTP desync on Heroku dynos is not a credible attack surface.

Summary of residual risks

Threat Residual Risk Mitigation Path
alg:none None (blocked by allowlist) CI test
RS256 → HS256 confusion None (key type enforced) CI test
JWKS poisoning None (pinned keys) Runbook: never add JWKS discovery
JWT replay (within TTL) Medium (15-min window) DPoP in Phase 4; Redis blocklist for incidents
Kid injection None (static matching)
Weak signing key None (KMS-backed) Verify KMS ARN on first deploy
RT theft (no rotation) High if not rotated Rotation-on-every-use (Section 2) removes this
RT theft (with rotation) Low–Medium (7-day hard ceiling) Reuse detection triggers full invalidation
Clock skew Low NTP on Heroku + 30s leeway

6. Lock-In Summary — ADR-0067 Amendment Values

The following values supersede the "15-min working value pending research" placeholder in ADR-0067 and design.md §6.

Parameter Operator working value Security-agent recommendation Status
Access-token TTL 15 min 15 min — confirmed Locked
Refresh-token TTL 30 days 7 days — counter-recommend Operator decision needed
Refresh-token rotation (not specified) Rotate on every use Locked
RT rotation grace period (not specified) 30 seconds Locked
RT reuse detection (not specified) Full session-chain invalidation Locked
Primary revocation strategy Wait-for-expiry Wait-for-expiry (confirmed) Locked
Incident revocation path (not specified) Targeted Redis blocklist via FLAG_QUEUE_INSTANT_REVOKE Locked
alg allowlist RS256 (implied) ["RS256"] hardcoded, no discovery Locked
JWKS endpoint on Raptor (not specified) Pinned key in env, no JWKS URL fetching Locked
Clock skew tolerance (not specified) 30 seconds (PyJWT leeway) Locked
RT storage (browser) (not specified) HttpOnly, Secure, SameSite=Strict cookie Locked

Disagreement flag: Refresh-token TTL

The operator's working value of 30 days is a defensible choice for a native mobile client with keychain storage, but is not the right default for a browser-based SPA. The key risk: a stolen RT from a browser context (XSS, CSRF on a non-Strict-cookie path, or session fixation) grants 30 days of silent re-authentication capability. Rotation-on-every-use reduces but does not eliminate this — it only triggers detection at the next legitimate use, which may be days away if the user logs in infrequently.

7 days is the recommendation because: 1. Schwab Trader API (the most directly comparable securities-adjacent public API) uses 7 days with mandatory re-auth. 2. It matches the "one week" mental model users already have for security-sensitive apps ("log back in weekly"). 3. If the customer is an active daily user, the 7-day window is effectively irrelevant — they are refreshing their RT constantly. 4. If the customer is inactive, the 7-day hard ceiling limits the blast radius of a stolen RT compared to 30 days.

If the operator retains 30 days, the compensating controls are: rotation-on-every-use (required, not optional), plus an explicit runbook entry documenting the 23-day residual risk window.

Outstanding research gaps

  1. iOS native keychain storage pattern: Research is complete for browser-based RT storage. iOS RT storage via the WebAuthn platform layer + Keychain has not been validated against a specific library version. This should be verified when iOS auth work begins (OQ-3).
  2. DPoP / sender-constraining (RFC 9449): Not implemented in v1. AT replay risk remains Medium. If Raxx adds a high-frequency API trading path (where AT theft is more impactful), DPoP should be evaluated before Phase 4.
  3. Anomaly system interface to FLAG_QUEUE_INSTANT_REVOKE: The anomaly system's ability to enable the instant-revoke flag is assumed but not designed. This needs a concrete internal API contract before Phase 2.