Raxx · internal docs

internal · gated ↑ index

Passkey E2E Encryption with Opt-In Shadow Analytics

Status: Draft — pending Kristerpher's path decision (see ADR 0017) Owner: software-architect Date: 2026-04-24 Parent issue: #250 — Passkey-keyed end-to-end encryption for customer data Related ADRs: 0001, 0002, 0003, 0017


1. Context

Issue #250 proposes passkey-keyed E2E encryption for all customer data. The consequential downside: Raxx can't read user trade content, which kills all server-side analytics on actual trade patterns — cohort insights, aggregate strategy recommendations, regime-pattern discovery, leaderboards (#211), and product-improvement telemetry on real behavior.

This doc evaluates a middle path: E2E encryption stays as the base invariant. Users who want to contribute to the analytics pool can opt in to a client-side shadow pipeline that anonymizes and aggregates their signals before submitting them to a separate, operator-readable analytics store. No individual trade data ever leaves the device in plaintext. The operator learns aggregate patterns; individual behavior stays private.

This is a research-and-design document. It does not authorize implementation. ADR 0017 records the architecture decision.


2. Invariants (non-negotiable)

The following invariants from the project-level constitution apply here with full force:

  1. No stored credentials. The PRF extension output (used for key derivation) must never be stored server-side, logged, or transmitted beyond the in-memory crypto operation.
  2. Passkeys / WebAuthn only. The encryption key derivation ceremony is passkey-rooted; there is no alternative key path that does not require a passkey assertion.
  3. GDPR by default. Opt-in consent must be compliant with GDPR Art. 7 (freely given, specific, informed, unambiguous, withdrawable). Withdrawal must produce a hard delete from the analytics store within 30 days per Art. 17.
  4. Audit trail. Consent grants, consent withdrawals, and analytics-delete events are written to audit_log.
  5. Paper-first gating. Unaffected by this design; live-trading paths remain gated.
  6. No PII in analytics store. The shadow store must be structurally incapable of re-identification. This is a design constraint, not a policy — the schema must make it impossible, not merely unlikely.

Any architecture that weakens these is a violation, not a tradeoff.


3. Prior Art

Signal — Sealed Sender + Private Contact Discovery

Signal's design insight: separate the "what" (message content, E2E encrypted) from the "that" (communication metadata). For the metadata-resistant features, Signal uses Private Information Retrieval and Oblivious RAM constructions so even contact discovery doesn't leak to the server. Relevant pattern: the operator learns nothing about individual interaction while still running the service. Signal does not run aggregate analytics on user behavior at all — they accept the product cost as brand differentiation. That is one legitimate choice.

Apple — Differential Privacy in iOS/macOS

Apple (2017, Erlingsson et al., WWDC 2017 session) collects keyboard usage, emoji frequency, Health type preferences, and Safari crash data using Local Differential Privacy (LDP). The core technique: before any data leaves the device, the client applies a randomized response mechanism (RAPPOR variant) that adds calibrated noise. The server sees aggregate histograms but cannot recover any individual's input even with access to all submitted records.

Concrete limits Apple respects: k≥100,000 users before publishing any aggregate (cohort floor). Privacy budget ε ≈ 4 per day per feature domain (moderate privacy). Their implementation is open-sourced in the Swift Algorithms repo.

Raxx scale concern: Apple's privacy budget works at hundreds of millions of devices. Raxx's user base in v1 will be in the thousands. At small cohorts, DP noise must be much larger (lower ε or higher noise) to prevent inference attacks, which may make aggregate data useless. This is a quantitative decision that depends on actual cohort sizes.

Google — RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response)

Google's 2014 RAPPOR paper (Erlingsson, Pihur, Korolova) is the canonical client-side LDP implementation. It uses Bloom filter encoding + Permanent Randomized Response + Instantaneous Randomized Response to collect frequency histograms of categorical values (e.g., "what is your default browser?") with a controlled privacy budget.

RAPPOR is well-suited to categorical signals (strategy family, error type, win-rate bucket). It performs poorly on continuous numeric signals (P&L, hold period in exact minutes). For continuous signals, bucketing into ≤10 categories first, then applying RAPPOR, is the standard approach.

Google — Federated Learning

Google's federated learning (2017, McMahan et al.) keeps training data on-device; only model gradients are aggregated server-side. This is more relevant for a recommendation engine than for analytics, but the boundary it enforces — gradients, not data — is the right mental model for Raxx's shadow pipeline: aggregates, not records.

For Raxx v1, full federated learning is out of scope (requires a model training infrastructure). But the "client computes the aggregate locally, sends only the result" pattern is directly applicable.

Proton Mail / Proton Drive

Proton's analytics approach: they collect zero content analytics. They do collect operational metrics (delivery success rates, storage usage, uptime) at the infrastructure layer, never at the content layer. Their transparency reports disclose counts of legal requests received, not user behavior data. They accept this as brand-defining.

Proton runs a separate, privacy-isolated "product usage statistics" feature (opt-in) that sends event types (clicked feature X) but never content. Events are stripped of user ID before storage. This is functionally identical to Raxx's proposed shadow pipeline — same structure, different domain.

Standard Notes

Standard Notes uses zero-knowledge architecture for all note content. For analytics they rely entirely on opt-in crash reporting (Sentry, user-initiated). They do no behavioral analytics. They accept the product cost. Their reasoning: "We can't improve what we can't see, so we ship conservatively and ask users explicitly."

Tuta (formerly Tutanota)

Tuta encrypts all email content and metadata (subject, sender, recipient in the body) client-side. For product analytics they ship Matomo (self-hosted) with IP anonymization and have users opt in. They do not correlate analytics events to user content. Their shadow analytics pattern: session-scoped pseudonymous identifier (resets on logout) + event type + no content fields. This is directly applicable to Raxx.

DuckDuckGo

DuckDuckGo collects aggregate query counts (what was searched, not who searched it) using hash bucketing to prevent single-query identification. Their key constraint: no persistent user identifier exists; every anonymous search is disconnected from every other. This is more radical than Raxx needs — Raxx has authenticated users who consent to aggregated analytics.

Cloudflare Privacy Gateway (OHTTP)

OHTTP (Oblivious HTTP, RFC 9458) inserts a relay between client and server so the server never sees the originating IP. Cloudflare's Privacy Gateway implements this for Apple's Private Relay and iCloud Private Relay. Relevant if Raxx wants analytics submissions that can't be correlated by IP — but at Raxx's scale, the implementation cost exceeds the marginal privacy gain. Not recommended for v1; note as a future option.


4. Anonymization Primitives — Applicability to Raxx

4.1 K-Anonymity

A dataset satisfies k-anonymity if every record is indistinguishable from at least k-1 others on the quasi-identifiers. For cohort analytics (strategy family, win-rate bucket, hold period bucket), this means: do not publish any bucket with fewer than k users.

Recommended k: 20 for v1. If a strategy family has fewer than 20 users who opted in, suppress that bucket entirely. This is a simple server-side enforcement rule.

Limitation: K-anonymity does not protect against linkage attacks or homogeneity attacks (if all k records in a bucket have the same value, the suppression is trivially defeated). Pair with data minimization (don't store anything not needed for the specific aggregate).

4.2 Differential Privacy (DP)

DP adds calibrated noise to query results so that the presence or absence of any single record changes the output only imperceptibly. The formal guarantee: for all adjacent datasets D and D' (differing by one record), the probability of any output O satisfies P[M(D)=O] / P[M(D')=O] ≤ exp(ε).

For Raxx: DP is most useful for continuous analytics (how many users opened a position in high-VIX regimes? what is the average win rate across strategy families?). With a privacy budget of ε=1 per query domain per week, the noise added would be detectable but statistically manageable at >1,000 respondents. Below 500 respondents, results may be noise-dominated and analytically useless.

Recommendation: Implement DP at the query layer in the analytics service (not client-side), using the Laplace mechanism for count queries and the Gaussian mechanism for histogram queries. At v1 scale, DP is a "ship-ready" guarantee primarily useful for future audit defense, not necessarily for meaningful noise reduction.

Privacy budget management: Each analytics query type consumes budget. A weekly reset with per-domain ceilings (strategy analytics ε=2/week, error analytics ε=1/week) is workable. This must be enforced at the analytics API layer.

4.3 Hashing / Tokenization

Suitable for categorical identifiers (strategy family name → hash). Useless for numeric signals (hashing a P&L value doesn't anonymize it — the distribution remains identifiable). Do not use for continuous values.

4.4 Aggregation-Only (Client Pre-Aggregation)

The simplest and most reliable technique: the client never sends individual records. It pre-aggregates over a time window (weekly), buckets continuous values, and sends only the bucket counts. The server receives "this user opened 3 trades in the credit-spread family at VIX>20 this week" rather than the three individual trades.

This is the recommended primary technique for Raxx's shadow pipeline. It requires no special crypto, is easy to audit, and produces the most useful signal at small scale.

What it protects against: The server cannot reconstruct individual trades even if the analytics store is fully breached, because individual records never existed there.

What it doesn't protect against: Timing correlation (if only one user traded in a given bucket this week, the bucket count=1 row is a de-anonymization event). Mitigated by k-anonymity floor: suppress buckets with count<k.

4.5 Homomorphic Encryption

Allows the server to compute on ciphertext without decrypting. Fully Homomorphic Encryption (FHE) remains impractically slow for most real-time use cases (though libraries like Microsoft SEAL and OpenFHE are maturing). Partially Homomorphic Encryption (PHE) is fast for specific operations (addition, multiplication) but not general aggregation.

For v1: explicitly out of scope. Mention in design as a v3 path if Raxx wants server-side computation over E2E-encrypted data without ever decrypting it.


5. Data to Shadow-Copy: Proposed Concrete List

Permitted — aggregate, anonymized signals

Signal Granularity Anonymization Rationale
Strategy family usage Weekly counts per family per user Client pre-aggregates; k-floor suppression Strategy recommendation engine
Win-rate bucket Bucket: <30%, 30-50%, 50-70%, >70% Bucket only, no raw value Cohort comparison
Hold-period distribution Bucket: intraday / 1-3d / 1wk / 1mo+ Bucket only Regime analysis
Regime at entry Binary: VIX>20 vs VIX≤20 at trade open Exact VIX not shared Regime-pattern discovery
Strategy parameter change patterns Which parameter (delta, expiry, width) was adjusted; not to what value Field name only UX improvement
Error/failure event types Error code + flow identifier No content, no trade amounts UX improvement, error rate telemetry
Paper-to-live transition event Did user graduate to live mode (yes/no) Boolean Conversion funnel
Session feature engagement Which top-level features used per week Feature ID only, no content Product usage

Prohibited — must never appear in shadow store

Data Reason
Specific trade prices Re-identifiable via public order flow
Specific positions (ticker, quantity) PII + commercially sensitive
P&L values (raw) Re-identifiable; sensitive
Ticker symbols used Identifiable; commercially sensitive
Exact trade timestamps Correlatable to public order flow
IP address or device fingerprint Direct PII
Email or user ID Direct link to encrypted store
Any content from strategy configurations E2E encrypted; must stay that way
Passkey credential IDs Authentication-related PII

6.1 Opt-In Flow

Default: shadow analytics are OFF. Raxx does not infer consent from account creation, from Founders trial enrollment, or from product use. A user must take an affirmative action.

The opt-in appears in account settings, post-onboarding. UI must show: - What is collected (list from §5, plain language) - What it is used for (product improvement, aggregate recommendations) - What is NOT collected (trade prices, positions, P&L, tickers) - How to withdraw and what happens when you do

6.2 Granular Toggles

Three independent consent scopes:

Scope Default What it covers
analytics.strategy_patterns OFF Strategy family usage, win-rate bucket, hold-period, regime
analytics.product_usage OFF Feature engagement, session patterns
analytics.error_telemetry OFF Error codes and flow identifiers

Each scope has an independent toggle. Turning off one scope does not reset others.

analytics_consent
  id              TEXT PK
  user_id         TEXT FK -> users.id ON DELETE CASCADE
  scope           TEXT NOT NULL  -- 'strategy_patterns' | 'product_usage' | 'error_telemetry'
  granted_at      TIMESTAMP NOT NULL
  withdrawn_at    TIMESTAMP NULL
  granted_version TEXT NOT NULL  -- privacy policy version at time of grant
  source          TEXT NOT NULL  -- 'settings_ui' | 'onboarding_prompt'

Consent records are immutable append-only. A withdrawal creates a new row with withdrawn_at set; it does not delete the grant row (needed for GDPR accountability proof under Art. 7(1)).

6.4 Withdrawal and Shadow-Data Deletion

On scope withdrawal: 1. Client stops submitting signals for that scope immediately. 2. A DELETE /api/analytics/shadow request (authenticated) queues a delete job. 3. The analytics store hard-deletes all rows associated with the shadow pseudonym within 30 days. 4. The shadow pseudonym token is rotated so new contributions can't be linked to deleted history (if user later re-opts-in). 5. audit_log records: analytics.consent.withdrawn, analytics.shadow.delete_queued, analytics.shadow.delete_completed.

GDPR Art. 7 requires: - Freely given: the service is fully functional without consent. No dark patterns. ✓ - Specific: granular scopes. ✓ - Informed: plain-language explainer required in UI. Feature-developer responsibility. - Unambiguous: active opt-in checkbox (no pre-ticked boxes). ✓ - Withdrawable at any time without detriment: withdrawal does not degrade service features. ✓

6.6 CCPA Mapping

Shadow analytics data is used for product improvement, not sold to third parties. Under CCPA, this is "internal research" and does not trigger "do not sell." However, if Raxx ever shares aggregate analytics with third parties (investors, partners), a "do not sell" mechanism is needed. This is a product decision outside v1 scope; flag it.


7. Architecture: The Shadow Pipeline

7.1 Components

                          ┌─────────────────────────────────┐
                          │           Antlers (browser)       │
                          │                                   │
   Passkey assertion ────►│  PRF key derivation               │
                          │       │                           │
                          │       ▼                           │
                          │  Encrypted user data store        │
                          │  (Raptor can't read this)         │
                          │       │                           │
                          │       ▼                           │
                          │  Shadow Aggregator (client-side)  │
                          │  - inspect raw events             │
                          │  - apply bucketing + k-floor      │
                          │  - attach shadow pseudonym        │
                          │  - serialize as analytics payload  │
                          └──────────────┬────────────────────┘
                                         │
                          (HTTPS; no user auth header)
                                         │
                          ┌──────────────▼────────────────────┐
                          │   Analytics API (separate service) │
                          │   - no DB foreign key to users     │
                          │   - no IP logged                   │
                          │   - pseudonym-keyed rows only      │
                          │   - DP noise at query layer        │
                          └──────────────┬────────────────────┘
                                         │
                          ┌──────────────▼────────────────────┐
                          │   Analytics Store (separate DB)    │
                          │   - no PII columns                  │
                          │   - no FK to users table           │
                          │   - schema enforces anonymization   │
                          └───────────────────────────────────┘

7.2 Shadow Pseudonym

The shadow pseudonym is derived client-side:

shadow_pseudonym = HKDF(
  ikm   = PRF_output,
  salt  = "raxx-shadow-v1",
  info  = "analytics-pseudonym",
  len   = 32 bytes
)

Properties: - Derived from the same PRF root as the encryption key, but via a distinct HKDF expansion so the two keys are cryptographically independent. - Deterministic per passkey per analytics epoch (rotate by changing the info string on withdrawal and re-grant). - Never transmitted to Raptor's main DB. Only submitted to the Analytics API. - Server cannot link shadow pseudonym to a user account because the PRF output is never known to the server.

Limitation: If a user uses multiple passkeys (multiple devices), each device derives the same pseudonym from the same underlying passkey (for platform passkeys synced via iCloud Keychain / Google Password Manager). For roaming authenticators, each physical key derives a different pseudonym — these appear as separate contributors in the analytics store. This is a known limitation; document it; do not engineer around it in v1.

7.3 Analytics Payload Schema

analytics_events
  id              TEXT PK (uuid v4, server-generated at receipt)
  pseudonym       BLOB NOT NULL  (32 bytes; no FK; no index on pseudonym alone)
  scope           TEXT NOT NULL
  event_type      TEXT NOT NULL
  payload_json    TEXT NOT NULL  (bucketed, no PII; schema-validated at intake)
  epoch_week      TEXT NOT NULL  (ISO week, e.g. "2026-W17")
  received_at     TIMESTAMP NOT NULL
  -- No user_id column. No email column. No IP column.

The Analytics API validates payload_json against a strict per-event-type schema before inserting. Any field not explicitly permitted by the schema is rejected at the API layer (not silently dropped — rejected with 400 so the client knows the payload was malformed).

7.4 Sequence: Opt-In Shadow Submission

sequenceDiagram
    participant U as User (browser)
    participant SA as Shadow Aggregator (client JS)
    participant AR as Antlers (React)
    participant RP as Raptor (main API)
    participant AN as Analytics API (separate)

    U->>AR: Opens Settings → opts in to strategy_patterns scope
    AR->>RP: POST /api/analytics/consent {scope, granted}
    RP->>RP: Insert analytics_consent row; write audit_log
    RP-->>AR: 200 OK

    Note over SA: On next trade event (client-side)
    SA->>SA: Observe raw trade event
    SA->>SA: Apply bucketing (strategy family, win-rate bucket, regime)
    SA->>SA: Suppress if cohort size < k (client-side estimate)
    SA->>SA: Derive shadow_pseudonym from PRF output
    SA->>SA: Build analytics payload (no PII, bucketed only)
    SA->>AN: POST /api/shadow/events {pseudonym, scope, event_type, payload, epoch_week}
    Note over AN: No session cookie. No auth header. Pseudonym is the only identity.
    AN->>AN: Validate payload schema (reject unknown fields)
    AN->>AN: Insert analytics_events row
    AN-->>SA: 201 Created

7.5 Sequence: Withdrawal + Shadow Delete

sequenceDiagram
    participant U as User
    participant AR as Antlers
    participant RP as Raptor
    participant AN as Analytics API

    U->>AR: Settings → Withdraw consent for strategy_patterns
    AR->>RP: POST /api/analytics/consent {scope, withdrawn}
    RP->>RP: Insert analytics_consent row (withdrawn_at); write audit_log
    RP->>RP: Enqueue analytics.shadow.delete job (pseudonym, scope)
    RP-->>AR: 200 OK
    AR->>AR: Shadow Aggregator stops emitting for that scope immediately

    Note over RP: Within 30 days (GDPR Art. 17)
    RP->>AN: DELETE /api/shadow/pseudonym/{pseudonym}/scope/{scope}
    AN->>AN: Hard-delete all matching analytics_events rows
    AN-->>RP: 204 No Content
    RP->>RP: Write audit_log: analytics.shadow.delete_completed
    RP->>RP: Rotate shadow pseudonym epoch for this user

7.6 Threat Model

What an attacker who fully breaches the analytics store can learn:

What they cannot learn:

Re-identification risk surface:

Raxx operator risk (insider threat):

Even an operator with full DB access cannot link analytics records to user accounts without access to the PRF output of the user's passkey — which is never stored. An operator trying to re-identify users would need: (a) the analytics store, (b) the consent table (which records pseudonyms only indirectly — the pseudonym isn't stored there either), and (c) the user's physical authenticator. This is the intended posture.


8. Migrations

If the shadow analytics path is chosen:

  1. New tables: analytics_consent in the main Raptor DB. No schema changes to existing tables.
  2. New service: Analytics API is a separate Raptor blueprint or standalone service with a separate SQLite (or Postgres, at scale) DB. The key invariant: no shared DB file with the main Raptor DB.
  3. Migration 0002_analytics_consent.sql: Creates analytics_consent table. No data migration (new capability; no existing consent records).
  4. Rollback: Drop analytics_consent table; shut down Analytics API. No user data affected (analytics data is separate; consent records are append-only and deletion is safe).

9. Rollout Plan

Phase Description Gate
Dark Analytics API deployed; no client-side shadow aggregator active; feature flag SHADOW_ANALYTICS=off Internal review of schema + threat model
Internal beta Shadow aggregator enabled for 5–10 internal accounts; consent UI visible; verify delete pipeline 30 days of clean operation; no re-identification events in threat model review
Founders beta Opt-in available to Founders cohort; prominent in settings Founders consent and withdrawal tested end-to-end
GA All users see opt-in prompt post-login Analytics store has k≥20 opted-in users before publishing any aggregate

Kill switch: SHADOW_ANALYTICS=off env flag disables the consent UI, the shadow aggregator client-side load, and the Analytics API intake endpoint simultaneously. Existing analytics data is untouched; it just stops growing.


10. Security Considerations


11. Tradeoff Analysis

Features this enables (that pure E2E blocks)

Feature Notes
Cohort-level strategy recommendations "Users with similar strategy patterns succeeded with credit spreads in low-IV environments"
UX error-rate improvement What flows have highest error rates, which users hit them
Regime-pattern discovery Do high-VIX entries outperform? Aggregate answer becomes possible
Aggregate leaderboards (#211) Leaderboard data can be drawn from opted-in shadow store rather than encrypted main store
Business analytics Strategy-family popularity, cohort retention by engagement pattern
Investor/product reporting Aggregate usage data (no PII) defensible to investors

Features this STILL blocks (even with shadow pipeline)

Feature Why still blocked
Per-user support debugging Support still cannot see actual trades; only shadow-store aggregates
Personalized recommendations based on individual history Shadow store has weekly buckets, not individual history
Regulatory discovery of a specific user's trade content Raxx still cannot decrypt user data
Backtest runs on server using user's own data Compute still must be client-side or user-authorized per-operation
Recovery-time data access If user loses passkey, their E2E data is still unrecoverable from server

What shadow pipeline does NOT change

The pure E2E invariant remains. Raxx cannot read individual user trade content. The shadow pipeline is an additive opt-in path; it does not weaken the base encryption posture. A user who never opts in has identical privacy to a pure E2E system.


12. Alternatives Considered (see ADR 0017 for full treatment)

Alternative Privacy cost Product cost Implementation cost
Pure E2E, no analytics None High — blocks all aggregate insight Low — simpler architecture
Mandatory shadow pipeline (all users) Moderate — all users share data Low — full analytics available Medium
Client-side telemetry only, no server aggregation Low Medium — crash data only, no behavioral insight Low
Server-side columnar encryption (Raxx-held keys) High — operator can always read with key Low — full analytics available Medium — envelope encryption
Opt-in shadow pipeline (this design) Low for non-opted-in users; moderate for opted-in users Medium — aggregate insight only for opted-in cohort High — two services, consent pipeline, delete pipeline

13. Open Questions (Require Kristerpher's Decision)

  1. ADR 0017 path decision. Which of the four alternatives does Kristerpher select? (see ADR 0017 §Decision placeholder)

  2. Analytics store infrastructure. Separate SQLite DB (simple, single-host) or separate Postgres (scalable, operationally heavier)? At <10,000 opted-in users, SQLite is sufficient. At Raxx's current scale, SQLite is the right call; revisit at 50,000 MAU.

  3. k-anonymity floor value. k=20 proposed. Is this high enough? Lower k means more data published but higher re-identification risk. Higher k means less data but stronger privacy guarantee. Decision depends on expected cohort sizes at launch.

  4. Privacy budget ε for DP. ε=1–2 per domain per week is a reasonable starting point. Formal DP requires a privacy budget accounting system. Is implementing this in v1, or is k-anonymity-only acceptable as the v1 anonymization guarantee?

  5. Analytics service boundary. Separate deployed service (own process, own port) vs. a Blueprint within Raptor with separate DB? Separate service is cleaner isolation; Blueprint is simpler to deploy. Recommend separate service; Kristerpher should confirm.

  6. Pseudonym rotation on re-opt-in. When a user withdraws and later re-opts-in, the pseudonym rotates (new epoch). This means the new analytics series can't be linked to the old one by the server, but it also means cohort continuity breaks. Is this the right privacy-vs-analytics tradeoff?

  7. DPIA requirement. Processing behavioral data (even aggregate, even anonymized) at scale for product improvement may require a DPIA under GDPR Art. 35, particularly if Raxx crosses the "systematic large-scale processing" threshold. At Founders scale this likely doesn't apply; revisit at GA with a formal DPIA.