Status: Design v1 — sub-cards pending filing
Owner: software-architect
Date: 2026-05-09 UTC
Milestone: #6 — raxx.app v1 — first non-operator user (due 2026-05-23 UTC)
Related ADRs: ADR-0065, ADR-0066, ADR-0067, ADR-0068
Sibling docs:
- docs/architecture/rbac-v2/design.md — RBAC V2 (now Queue's responsibility)
- docs/architecture/customer-audit-unified/design.md — audit chain (Queue owns the writer for customer audit dimensions)
- docs/architecture/auth.md — original auth design (superseded at the boundary by this doc)
- docs/architecture/session-engine.md — session engine (now Queue's implementation)
Refs: operator decision 2026-05-09 UTC — project_queue_identity_service.md
Raxx today is a single-operator developer tool. Milestone #6 introduces the first non-operator customer account. That step requires a clear owner for identity, sessions, passkey credentials, RBAC, and customer records — data that is currently scattered across Raptor's SQLite (backend_v2/) and the Console's Postgres DB.
The operator chose Option C — a dedicated identity/customer/RBAC service called Queue on 2026-05-09 UTC. Queue is the single source of truth for:
Queue does not own: trades, positions, orders (Raptor), rotation jobs (Velvet), sentiment scoring (Reasonator), or Console operator UI/console audit.
The following are non-negotiable and override any design choice below.
| # | Invariant |
|---|---|
| I-1 | No stored credentials. Queue never stores passwords, plaintext recovery tokens, TOTP seeds, or any value that could replay a user secret. WebAuthn public keys and COSE keys are not credentials in this sense. Recovery codes are stored as one-way HMAC hashes. |
| I-2 | Passkeys / WebAuthn only. No password path, no SMS OTP, no email OTP as auth factor. Queue enforces this at the schema level — no column exists for a password hash. |
| I-3 | Email is the single contact channel, and only after verification. No phone, no SMS, no push. |
| I-4 | GDPR by default. Customer PII (email, display name, IP prefix, geoblock metadata) has explicit retention periods, DSR erasure paths, portability export, and DPA-ready audit logging. |
| I-5 | Audit trail for every state change that affects money, permissions, or data access. Every grant, revoke, session mint, session revoke, registration, erasure, and passkey change writes an audit row. |
| I-6 | Paper-first gating is enforced by Queue. Queue issues session tokens that carry the paper_first_gate claim. Live-trading paths check this claim; Queue never issues a live-enabled token without the gate being satisfied. |
| I-7 | Credentials into infra, not into code. All Queue secrets (DB URL, signing keys, KMS key ARN, service-to-service tokens) live in SSM (AWS workloads) or Infisical (vendor tokens), never in repo files. |
| I-8 | Quebec geo-block at signup. Queue enforces the geo-block at the POST /api/v1/customers (registration) endpoint, rejecting jurisdiction=QC with a friendly 422. The block is a configurable env flag so it can be lifted when fr-CA launches. |
| I-9 | Fail-closed on Queue outage. If Queue is unreachable, Raptor must reject all authenticated requests (return 503) rather than fail-open. There is no credential cache in Raptor that grants access when Queue is down. |
| I-10 | All timestamps UTC. |
Decision: Queue ships in v1 as a Flask blueprint bundle (queue/) deployed into the same Heroku app as Raptor (backend_v2), backed by Raptor's existing Postgres DB via a queue_ namespace prefix on new tables. Existing Raptor auth tables (customer_sessions, webauthn_credentials, customers, etc.) are migrated-in-place with renamed columns to match Queue's schema contract. Queue exposes /api/v1/ identity endpoints. Raptor's own auth blueprints are deprecated (feature-flagged to return 404) once Queue endpoints are live.
Why not Greenfield: - A new Heroku app + new DB requires 5-7 dev-days for infrastructure alone, consuming the entire remaining v1 timeline. - Data migration from Raptor's SQLite/Postgres to a new DB across two apps in a 14-day window is high-risk. - PRs #1502, #1503, #1505, #1506, #1507, #1508 are all open and mostly done — their work is not lost; it becomes Queue's internal implementation.
Why Strangler-Fig works: - Queue is architecturally real: it has a defined API surface, service-to-service auth, and owns its schema namespace. - Data extraction to a standalone Heroku app is Phase 4 (post-v1), with a clean migration plan. - The contract layer is what matters for iOS, Antlers, and SAML — not the physical hosting.
See ADR-0065 for the full decision record.
queue/ ← sibling to backend_v2/, console/
app.py ← Flask application factory
api/
routes/
auth.py ← WebAuthn register/login, sessions
customers.py ← customer records + GDPR DSR
rbac.py ← grants, permission checks
audit.py ← customer audit event writer
services/
webauthn_service.py
session_service.py
rbac_service.py
audit_writer_service.py
email_service.py
middleware/
service_auth.py ← validates inbound service tokens
rate_limiter.py
db/
migrations/ ← queue_* table migrations
tests/
Queue runs as a Flask app on the same Heroku dyno as Raptor in v1 (different port, same process group, or as a Blueprint mounted at /api/v1/ in Raptor's app factory — see ADR-0066).
All Queue-owned tables are prefixed queue_ to make namespace ownership unambiguous during the co-location phase. Post-extraction they are renamed to their canonical names.
-- Customers (replaces/renames existing 'customers' table)
CREATE TABLE queue_customers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT UNIQUE NOT NULL,
email_verified_at TIMESTAMPTZ NULL,
display_name TEXT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
deleted_at TIMESTAMPTZ NULL, -- soft-delete for DSR audit
jurisdiction TEXT NULL, -- 'US' | 'CA' | 'CA-QC' (blocked at signup)
geo_block_reason TEXT NULL, -- e.g. 'QC_PRE_FR_LAUNCH'
paper_first_cycles INTEGER NOT NULL DEFAULT 0,
paper_first_gate_met BOOLEAN NOT NULL DEFAULT false,
schema_version INTEGER NOT NULL DEFAULT 1
);
-- WebAuthn credentials (replaces existing webauthn_credentials)
CREATE TABLE queue_webauthn_credentials (
id TEXT PRIMARY KEY, -- credential_id, base64url
customer_id UUID NOT NULL REFERENCES queue_customers(id) ON DELETE CASCADE,
public_key BYTEA NOT NULL, -- COSE key; not a secret
sign_count INTEGER NOT NULL DEFAULT 0,
transports TEXT NULL, -- csv: 'usb,nfc,internal'
aaguid TEXT NULL,
device_label TEXT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_used_at TIMESTAMPTZ NULL,
backup_eligible BOOLEAN NOT NULL DEFAULT false,
backup_state BOOLEAN NOT NULL DEFAULT false
);
-- Sessions (replaces/extends existing customer_sessions)
CREATE TABLE queue_sessions (
id TEXT PRIMARY KEY, -- 256-bit random, stored hashed
customer_id UUID NOT NULL REFERENCES queue_customers(id) ON DELETE CASCADE,
credential_id TEXT REFERENCES queue_webauthn_credentials(id),
token_hash TEXT NOT NULL, -- SHA-256 of bearer; the raw token is never stored
issued_at TIMESTAMPTZ NOT NULL DEFAULT now(),
idle_timeout_secs INTEGER NOT NULL DEFAULT 1800, -- 30 min idle
absolute_expires_at TIMESTAMPTZ NOT NULL, -- 12h hard ceiling
revoked_at TIMESTAMPTZ NULL,
fresh_until TIMESTAMPTZ NOT NULL, -- step-up expiry
last_seen_at TIMESTAMPTZ NULL,
ip_prefix TEXT NULL, -- /24 IPv4 or /48 IPv6 (minimized PII)
user_agent TEXT NULL
);
-- Email verifications
CREATE TABLE queue_email_verifications (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES queue_customers(id) ON DELETE CASCADE,
email TEXT NOT NULL,
token_hash TEXT NOT NULL, -- SHA-256 of single-use link; raw token never stored
expires_at TIMESTAMPTZ NOT NULL, -- 15 min
consumed_at TIMESTAMPTZ NULL,
purpose TEXT NOT NULL CHECK (purpose IN ('initial','recovery','rectification')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Backup / recovery codes
CREATE TABLE queue_backup_codes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES queue_customers(id) ON DELETE CASCADE,
code_hmac TEXT NOT NULL, -- HMAC-SHA-256 of raw code; raw code never stored
batch_id UUID NOT NULL, -- all codes in one generate() call share a batch_id
used_at TIMESTAMPTZ NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_queue_backup_codes_customer ON queue_backup_codes(customer_id);
-- WebAuthn registration challenges (short-lived, TTL-enforced)
CREATE TABLE queue_webauthn_challenges (
challenge_hash TEXT PRIMARY KEY, -- SHA-256 of raw challenge
customer_id UUID NULL, -- null during registration (pre-user)
purpose TEXT NOT NULL CHECK (purpose IN ('register','login','add_device')),
expires_at TIMESTAMPTZ NOT NULL, -- 60s
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Queue owns all RBAC tables. The existing Console Postgres tables (rbac_groups, rbac_roles, etc. from migration 0021) are the current source of truth for operator-side RBAC. In Phase 3 these tables are extracted to Queue's DB alongside customer RBAC. In Phase 1, Queue reads/writes them directly (co-located DB).
Queue introduces customer-facing RBAC (queue_customer_roles) that is separate from operator RBAC:
-- Customer product-tier roles (antlers-user, antlers-founders, antlers-pro)
CREATE TABLE queue_customer_roles (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES queue_customers(id) ON DELETE CASCADE,
role TEXT NOT NULL, -- 'antlers-user' | 'antlers-founders' | 'antlers-pro'
granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
granted_by TEXT NOT NULL, -- 'system' | 'operator:<admin_id>'
revoked_at TIMESTAMPTZ NULL,
CONSTRAINT chk_queue_customer_role CHECK (
role IN ('antlers-user','antlers-founders','antlers-pro','antlers-support-readonly')
)
);
Operator RBAC (groups, roles, permissions, grants, ticket-scoped) remains in the existing rbac_* tables until Phase 3 extraction.
Decision: Signed JWT issued by Queue; Raptor verifies offline. (See ADR-0067.)
This is the critical performance decision. Two options:
QUEUE_JWT_PUBLIC_KEY (asymmetric RS256). No Queue call per request. Revocation is handled via a short-lived token window: a revoked session takes at most 15 minutes to expire (acceptable for v1 scale).Chosen: Signed JWT (RS256, 15-min TTL). Rationale:
sub (customer_id), sid (session_id), tier, roles, paper_first_gate, fresh_until, iat, exp.QUEUE_JWT_SIGNING_KEY (private, in SSM) and QUEUE_JWT_PUBLIC_KEY (public, in env on Raptor). Dual-accept overlap window of 5 minutes during rotation.GET /api/v1/sessions/current/status to confirm freshness — these are low-frequency.Raptor, Console, Velvet, and Reasonator call Queue using per-service HMAC-signed tokens:
QUEUE_SERVICE_TOKEN_<SERVICE> environment variable (SSM for AWS-resident services, Infisical for vendor tokens).Authorization: Bearer <token> on Queue-internal endpoints (/api/internal/v1/*).| Failure | Behavior |
|---|---|
| Queue service down (co-located same dyno) | Process crash takes both Raptor and Queue down. Returns 503 to all clients. Session JWTs already in flight continue to be valid for up to 15 minutes (offline verification). |
| Queue DB unreachable | Queue returns 503 to all new auth requests. Existing valid JWTs continue to work until TTL expires. No fail-open. |
| KMS unreachable (audit HMAC) | Audit writes queue a retry (in-memory or Postgres job table). Auth and session operations continue normally. KMS failure does not block login. |
| JWT signing key rotation mid-flight | Dual-accept window of 5 minutes. Raptor accepts tokens signed by either the current or previous key during the window. |
Queue down = all new sessions blocked. This is intentional. The alternative (fail-open) risks unauthenticated access. The outage surface is mitigated by co-location (same dyno, same availability as Raptor today).
Decision: Queue owns the audit writer for all three dimensions of customer audit events.
Rationale: audit events are fundamentally about customer identity and access. The audit writer in PR #1506 lives in Raptor today only because Queue did not exist. Once Queue is live, POST /api/internal/v1/audit/event in Queue replaces the Raptor audit endpoint. Raptor becomes a caller of Queue's audit writer, not the owner.
The customer_audit_events table (schema from PR #1502/migration 016) stays in Raptor's Postgres DB during Phase 1-2 (co-location). It moves to Queue's DB in Phase 3.
Console becomes a thin operator UI layer. Queue's admin surface (/api/v1/admin/*) exposes customer management, RBAC admin, and audit query. Console calls Queue's admin endpoints rather than reading the DB directly. Console retains:
console_audit_logConsole loses: direct customer-table reads, direct RBAC table writes. These move to Queue's admin API.
sequenceDiagram
participant U as User (Antlers)
participant Q as Queue
participant R as Raptor
participant M as Postmark
U->>Q: POST /api/v1/auth/webauthn/register/begin {email, jurisdiction}
Q->>Q: Check jurisdiction (reject QC with 422)
Q->>Q: Generate WebAuthn challenge, store queue_webauthn_challenges
Q-->>U: PublicKeyCredentialCreationOptions
U->>U: Browser prompts Face ID / YubiKey
U->>Q: POST /api/v1/auth/webauthn/register/complete {attestation}
Q->>Q: py-webauthn verify attestation
Q->>Q: INSERT queue_customers + queue_webauthn_credentials
Q->>Q: INSERT queue_customer_roles (antlers-user)
Q->>Q: Emit customer_audit_events (customer.registered)
Q->>M: Send verification email (single-use token)
Q-->>U: {customer_id, needs_email_verification: true}
U->>Q: POST /api/v1/auth/email/verify {code}
Q->>Q: Stamp email_verified_at_utc
Q->>Q: Emit customer_audit_events (email.verified)
Q-->>U: {verified: true}
sequenceDiagram
participant U as User (Antlers)
participant Q as Queue
participant R as Raptor
U->>Q: POST /api/v1/auth/webauthn/login/begin
Q-->>U: PublicKeyCredentialRequestOptions (allowCredentials:[])
U->>U: Browser shows passkey picker
U->>Q: POST /api/v1/auth/webauthn/login/complete {assertion}
Q->>Q: Verify assertion, update sign_count
Q->>Q: Mint queue_sessions row + sign JWT (RS256, 15-min TTL)
Q->>Q: Emit customer_audit_events (session.issued)
Q-->>U: Set-Cookie (HttpOnly) + {jwt, customer_id, roles}
U->>R: GET /api/v1/portfolio (Authorization: Bearer <jwt>)
R->>R: Verify JWT offline (RS256 public key)
R-->>U: Portfolio data
| Phase | Gate | Description |
|---|---|---|
| Dark | FLAG_QUEUE_V1=off |
Queue code deployed; all endpoints return 404. Migrations run. |
| Internal | FLAG_QUEUE_V1=staging |
Queue endpoints live on staging. Raptor auth blueprints remain active. Dual-mode middleware logs both paths. |
| Beta | FLAG_QUEUE_V1=beta |
Queue endpoints live on prod. Raptor auth blueprints still active (fallback). |
| Cutover | FLAG_QUEUE_V1=on, FLAG_RAPTOR_AUTH_LEGACY=off |
Raptor auth blueprints return 404. Queue is the sole auth surface. |
| Cleanup (post-v1) | Remove legacy Raptor auth blueprints, migrate tables to Queue's own DB. |
Each FLAG_QUEUE_* flag must have a console_flag_promotions row before being promoted to production.
PII collected:
- email (registration, verification) — verified email only
- display_name (optional, customer-supplied)
- ip_prefix (/24 or /48; minimized) — on sessions only
- jurisdiction — country/province code for geo-block
Retention:
- Customer records: active until DSR erasure request + 30-day cooling period
- Sessions: purged at absolute_expires_at; audit shadow retained 2 years
- Audit events: 2 years (GDPR Art. 30 obligation)
- Backup codes: purged with customer record on DSR erasure
- WebAuthn challenges: 60s TTL enforced at application layer + nightly cleanup job
DSR erasure path:
- Soft-delete queue_customers.deleted_at
- Purge email and display_name after 30-day cooling period
- Pseudonymize customer_id in audit rows (replace with dsr_pseudonym_<hash>)
- Revoke all active sessions
- Invalidate all passkeys
- Emit customer.erased audit event (retained with pseudonym for 2 years)
Audit trail: Every Queue endpoint that mutates customer, session, credential, or RBAC state emits to customer_audit_events. Audit writes use HMAC-SHA-256 + AWS KMS (alias/raxx-audit-hmac, ARN in SSM at /raxx/audit/hmac-key-arn) per ADR-0058 and the KMS budget approved 2026-05-09 UTC.
Breach response:
- Any breach.* action in customer_audit_events triggers the GitHub Actions breach-notification pipeline (per auth.md §8).
- 72-hour GDPR Art. 33 clock starts on first breach audit write.
- ops@raxx.app paged within 15 minutes.
- Per-service tokens revocable without redeploy by rotating SSM params and restarting dynos.
Kill-switch:
- FLAG_QUEUE_V1=off disables all Queue endpoints.
- QUEUE_REVOKE_ALL=1 revokes all customer sessions (writes audit row per session).
- AUTH_DISABLED=1 (existing Raptor flag) returns 503 on all auth attempts.
No stored credentials: enforced at schema level — no columns named password, secret, otp_seed, recovery_token exist in Queue tables. CI grep (scripts/ci/check_no_credential_fields.sh) covers queue/ directory.
These require operator decision before the corresponding sub-cards can be claimed.
OQ-1 — Audit events physical location (post-extraction)
Queue logically owns audit events. Currently customer_audit_events lives in Raptor's Postgres (migration 016, PR #1502). Should extraction to Queue's own DB (Phase 3) also move the audit table, or keep audit in Raptor's DB (closer to the writers) with Queue as the API owner only? Recommendation: move to Queue's DB in Phase 3 for clean ownership. If kept in Raptor's DB, the audit API is still Queue's but Queue queries cross-DB.
OQ-2 — Console scope shrinkage When Queue owns customers, Console's customer-admin endpoints become proxies to Queue's admin API. Is the operator comfortable with Console having no direct DB reads for customer data? This is the correct long-term posture but requires that Queue's admin API expose sufficient query depth for Console's operator workflows.
OQ-3 — iOS (#167) timing Does iOS v1 launch with Queue API or the older direct-Raptor pattern? If iOS launches simultaneously with Queue, all iOS auth work must target Queue's endpoints. If iOS launches post-v1, it can ignore this design for now. Decision needed before filing iOS sub-cards.
OQ-4 — Session token revocation window The JWT approach means a revoked session remains technically valid for up to 15 minutes (JWT TTL). Is this acceptable, or does the operator require instant revocation? Instant revocation requires Raptor to call Queue on every request (adds latency) or uses a short-lived token blocklist (adds Redis dependency). Recommendation: accept 15-minute window for v1; add a blocklist if a security incident demands it.