Status: Draft — pending Kristerpher's path decision (see ADR 0017) Owner: software-architect Date: 2026-04-24 Parent issue: #250 — Passkey-keyed end-to-end encryption for customer data Related ADRs: 0001, 0002, 0003, 0017
Issue #250 proposes passkey-keyed E2E encryption for all customer data. The consequential downside: Raxx can't read user trade content, which kills all server-side analytics on actual trade patterns — cohort insights, aggregate strategy recommendations, regime-pattern discovery, leaderboards (#211), and product-improvement telemetry on real behavior.
This doc evaluates a middle path: E2E encryption stays as the base invariant. Users who want to contribute to the analytics pool can opt in to a client-side shadow pipeline that anonymizes and aggregates their signals before submitting them to a separate, operator-readable analytics store. No individual trade data ever leaves the device in plaintext. The operator learns aggregate patterns; individual behavior stays private.
This is a research-and-design document. It does not authorize implementation. ADR 0017 records the architecture decision.
The following invariants from the project-level constitution apply here with full force:
audit_log.Any architecture that weakens these is a violation, not a tradeoff.
Signal's design insight: separate the "what" (message content, E2E encrypted) from the "that" (communication metadata). For the metadata-resistant features, Signal uses Private Information Retrieval and Oblivious RAM constructions so even contact discovery doesn't leak to the server. Relevant pattern: the operator learns nothing about individual interaction while still running the service. Signal does not run aggregate analytics on user behavior at all — they accept the product cost as brand differentiation. That is one legitimate choice.
Apple (2017, Erlingsson et al., WWDC 2017 session) collects keyboard usage, emoji frequency, Health type preferences, and Safari crash data using Local Differential Privacy (LDP). The core technique: before any data leaves the device, the client applies a randomized response mechanism (RAPPOR variant) that adds calibrated noise. The server sees aggregate histograms but cannot recover any individual's input even with access to all submitted records.
Concrete limits Apple respects: k≥100,000 users before publishing any aggregate (cohort floor). Privacy budget ε ≈ 4 per day per feature domain (moderate privacy). Their implementation is open-sourced in the Swift Algorithms repo.
Raxx scale concern: Apple's privacy budget works at hundreds of millions of devices. Raxx's user base in v1 will be in the thousands. At small cohorts, DP noise must be much larger (lower ε or higher noise) to prevent inference attacks, which may make aggregate data useless. This is a quantitative decision that depends on actual cohort sizes.
Google's 2014 RAPPOR paper (Erlingsson, Pihur, Korolova) is the canonical client-side LDP implementation. It uses Bloom filter encoding + Permanent Randomized Response + Instantaneous Randomized Response to collect frequency histograms of categorical values (e.g., "what is your default browser?") with a controlled privacy budget.
RAPPOR is well-suited to categorical signals (strategy family, error type, win-rate bucket). It performs poorly on continuous numeric signals (P&L, hold period in exact minutes). For continuous signals, bucketing into ≤10 categories first, then applying RAPPOR, is the standard approach.
Google's federated learning (2017, McMahan et al.) keeps training data on-device; only model gradients are aggregated server-side. This is more relevant for a recommendation engine than for analytics, but the boundary it enforces — gradients, not data — is the right mental model for Raxx's shadow pipeline: aggregates, not records.
For Raxx v1, full federated learning is out of scope (requires a model training infrastructure). But the "client computes the aggregate locally, sends only the result" pattern is directly applicable.
Proton's analytics approach: they collect zero content analytics. They do collect operational metrics (delivery success rates, storage usage, uptime) at the infrastructure layer, never at the content layer. Their transparency reports disclose counts of legal requests received, not user behavior data. They accept this as brand-defining.
Proton runs a separate, privacy-isolated "product usage statistics" feature (opt-in) that sends event types (clicked feature X) but never content. Events are stripped of user ID before storage. This is functionally identical to Raxx's proposed shadow pipeline — same structure, different domain.
Standard Notes uses zero-knowledge architecture for all note content. For analytics they rely entirely on opt-in crash reporting (Sentry, user-initiated). They do no behavioral analytics. They accept the product cost. Their reasoning: "We can't improve what we can't see, so we ship conservatively and ask users explicitly."
Tuta encrypts all email content and metadata (subject, sender, recipient in the body) client-side. For product analytics they ship Matomo (self-hosted) with IP anonymization and have users opt in. They do not correlate analytics events to user content. Their shadow analytics pattern: session-scoped pseudonymous identifier (resets on logout) + event type + no content fields. This is directly applicable to Raxx.
DuckDuckGo collects aggregate query counts (what was searched, not who searched it) using hash bucketing to prevent single-query identification. Their key constraint: no persistent user identifier exists; every anonymous search is disconnected from every other. This is more radical than Raxx needs — Raxx has authenticated users who consent to aggregated analytics.
OHTTP (Oblivious HTTP, RFC 9458) inserts a relay between client and server so the server never sees the originating IP. Cloudflare's Privacy Gateway implements this for Apple's Private Relay and iCloud Private Relay. Relevant if Raxx wants analytics submissions that can't be correlated by IP — but at Raxx's scale, the implementation cost exceeds the marginal privacy gain. Not recommended for v1; note as a future option.
A dataset satisfies k-anonymity if every record is indistinguishable from at least k-1 others on the quasi-identifiers. For cohort analytics (strategy family, win-rate bucket, hold period bucket), this means: do not publish any bucket with fewer than k users.
Recommended k: 20 for v1. If a strategy family has fewer than 20 users who opted in, suppress that bucket entirely. This is a simple server-side enforcement rule.
Limitation: K-anonymity does not protect against linkage attacks or homogeneity attacks (if all k records in a bucket have the same value, the suppression is trivially defeated). Pair with data minimization (don't store anything not needed for the specific aggregate).
DP adds calibrated noise to query results so that the presence or absence of any single record changes the output only imperceptibly. The formal guarantee: for all adjacent datasets D and D' (differing by one record), the probability of any output O satisfies P[M(D)=O] / P[M(D')=O] ≤ exp(ε).
For Raxx: DP is most useful for continuous analytics (how many users opened a position in high-VIX regimes? what is the average win rate across strategy families?). With a privacy budget of ε=1 per query domain per week, the noise added would be detectable but statistically manageable at >1,000 respondents. Below 500 respondents, results may be noise-dominated and analytically useless.
Recommendation: Implement DP at the query layer in the analytics service (not client-side), using the Laplace mechanism for count queries and the Gaussian mechanism for histogram queries. At v1 scale, DP is a "ship-ready" guarantee primarily useful for future audit defense, not necessarily for meaningful noise reduction.
Privacy budget management: Each analytics query type consumes budget. A weekly reset with per-domain ceilings (strategy analytics ε=2/week, error analytics ε=1/week) is workable. This must be enforced at the analytics API layer.
Suitable for categorical identifiers (strategy family name → hash). Useless for numeric signals (hashing a P&L value doesn't anonymize it — the distribution remains identifiable). Do not use for continuous values.
The simplest and most reliable technique: the client never sends individual records. It pre-aggregates over a time window (weekly), buckets continuous values, and sends only the bucket counts. The server receives "this user opened 3 trades in the credit-spread family at VIX>20 this week" rather than the three individual trades.
This is the recommended primary technique for Raxx's shadow pipeline. It requires no special crypto, is easy to audit, and produces the most useful signal at small scale.
What it protects against: The server cannot reconstruct individual trades even if the analytics store is fully breached, because individual records never existed there.
What it doesn't protect against: Timing correlation (if only one user traded in a given bucket this week, the bucket count=1 row is a de-anonymization event). Mitigated by k-anonymity floor: suppress buckets with count<k.
Allows the server to compute on ciphertext without decrypting. Fully Homomorphic Encryption (FHE) remains impractically slow for most real-time use cases (though libraries like Microsoft SEAL and OpenFHE are maturing). Partially Homomorphic Encryption (PHE) is fast for specific operations (addition, multiplication) but not general aggregation.
For v1: explicitly out of scope. Mention in design as a v3 path if Raxx wants server-side computation over E2E-encrypted data without ever decrypting it.
| Signal | Granularity | Anonymization | Rationale |
|---|---|---|---|
| Strategy family usage | Weekly counts per family per user | Client pre-aggregates; k-floor suppression | Strategy recommendation engine |
| Win-rate bucket | Bucket: <30%, 30-50%, 50-70%, >70% | Bucket only, no raw value | Cohort comparison |
| Hold-period distribution | Bucket: intraday / 1-3d / 1wk / 1mo+ | Bucket only | Regime analysis |
| Regime at entry | Binary: VIX>20 vs VIX≤20 at trade open | Exact VIX not shared | Regime-pattern discovery |
| Strategy parameter change patterns | Which parameter (delta, expiry, width) was adjusted; not to what value | Field name only | UX improvement |
| Error/failure event types | Error code + flow identifier | No content, no trade amounts | UX improvement, error rate telemetry |
| Paper-to-live transition event | Did user graduate to live mode (yes/no) | Boolean | Conversion funnel |
| Session feature engagement | Which top-level features used per week | Feature ID only, no content | Product usage |
| Data | Reason |
|---|---|
| Specific trade prices | Re-identifiable via public order flow |
| Specific positions (ticker, quantity) | PII + commercially sensitive |
| P&L values (raw) | Re-identifiable; sensitive |
| Ticker symbols used | Identifiable; commercially sensitive |
| Exact trade timestamps | Correlatable to public order flow |
| IP address or device fingerprint | Direct PII |
| Email or user ID | Direct link to encrypted store |
| Any content from strategy configurations | E2E encrypted; must stay that way |
| Passkey credential IDs | Authentication-related PII |
Default: shadow analytics are OFF. Raxx does not infer consent from account creation, from Founders trial enrollment, or from product use. A user must take an affirmative action.
The opt-in appears in account settings, post-onboarding. UI must show: - What is collected (list from §5, plain language) - What it is used for (product improvement, aggregate recommendations) - What is NOT collected (trade prices, positions, P&L, tickers) - How to withdraw and what happens when you do
Three independent consent scopes:
| Scope | Default | What it covers |
|---|---|---|
analytics.strategy_patterns |
OFF | Strategy family usage, win-rate bucket, hold-period, regime |
analytics.product_usage |
OFF | Feature engagement, session patterns |
analytics.error_telemetry |
OFF | Error codes and flow identifiers |
Each scope has an independent toggle. Turning off one scope does not reset others.
analytics_consent
id TEXT PK
user_id TEXT FK -> users.id ON DELETE CASCADE
scope TEXT NOT NULL -- 'strategy_patterns' | 'product_usage' | 'error_telemetry'
granted_at TIMESTAMP NOT NULL
withdrawn_at TIMESTAMP NULL
granted_version TEXT NOT NULL -- privacy policy version at time of grant
source TEXT NOT NULL -- 'settings_ui' | 'onboarding_prompt'
Consent records are immutable append-only. A withdrawal creates a new row with withdrawn_at set; it does not delete the grant row (needed for GDPR accountability proof under Art. 7(1)).
On scope withdrawal:
1. Client stops submitting signals for that scope immediately.
2. A DELETE /api/analytics/shadow request (authenticated) queues a delete job.
3. The analytics store hard-deletes all rows associated with the shadow pseudonym within 30 days.
4. The shadow pseudonym token is rotated so new contributions can't be linked to deleted history (if user later re-opts-in).
5. audit_log records: analytics.consent.withdrawn, analytics.shadow.delete_queued, analytics.shadow.delete_completed.
GDPR Art. 7 requires: - Freely given: the service is fully functional without consent. No dark patterns. ✓ - Specific: granular scopes. ✓ - Informed: plain-language explainer required in UI. Feature-developer responsibility. - Unambiguous: active opt-in checkbox (no pre-ticked boxes). ✓ - Withdrawable at any time without detriment: withdrawal does not degrade service features. ✓
Shadow analytics data is used for product improvement, not sold to third parties. Under CCPA, this is "internal research" and does not trigger "do not sell." However, if Raxx ever shares aggregate analytics with third parties (investors, partners), a "do not sell" mechanism is needed. This is a product decision outside v1 scope; flag it.
┌─────────────────────────────────┐
│ Antlers (browser) │
│ │
Passkey assertion ────►│ PRF key derivation │
│ │ │
│ ▼ │
│ Encrypted user data store │
│ (Raptor can't read this) │
│ │ │
│ ▼ │
│ Shadow Aggregator (client-side) │
│ - inspect raw events │
│ - apply bucketing + k-floor │
│ - attach shadow pseudonym │
│ - serialize as analytics payload │
└──────────────┬────────────────────┘
│
(HTTPS; no user auth header)
│
┌──────────────▼────────────────────┐
│ Analytics API (separate service) │
│ - no DB foreign key to users │
│ - no IP logged │
│ - pseudonym-keyed rows only │
│ - DP noise at query layer │
└──────────────┬────────────────────┘
│
┌──────────────▼────────────────────┐
│ Analytics Store (separate DB) │
│ - no PII columns │
│ - no FK to users table │
│ - schema enforces anonymization │
└───────────────────────────────────┘
The shadow pseudonym is derived client-side:
shadow_pseudonym = HKDF(
ikm = PRF_output,
salt = "raxx-shadow-v1",
info = "analytics-pseudonym",
len = 32 bytes
)
Properties:
- Derived from the same PRF root as the encryption key, but via a distinct HKDF expansion so the two keys are cryptographically independent.
- Deterministic per passkey per analytics epoch (rotate by changing the info string on withdrawal and re-grant).
- Never transmitted to Raptor's main DB. Only submitted to the Analytics API.
- Server cannot link shadow pseudonym to a user account because the PRF output is never known to the server.
Limitation: If a user uses multiple passkeys (multiple devices), each device derives the same pseudonym from the same underlying passkey (for platform passkeys synced via iCloud Keychain / Google Password Manager). For roaming authenticators, each physical key derives a different pseudonym — these appear as separate contributors in the analytics store. This is a known limitation; document it; do not engineer around it in v1.
analytics_events
id TEXT PK (uuid v4, server-generated at receipt)
pseudonym BLOB NOT NULL (32 bytes; no FK; no index on pseudonym alone)
scope TEXT NOT NULL
event_type TEXT NOT NULL
payload_json TEXT NOT NULL (bucketed, no PII; schema-validated at intake)
epoch_week TEXT NOT NULL (ISO week, e.g. "2026-W17")
received_at TIMESTAMP NOT NULL
-- No user_id column. No email column. No IP column.
The Analytics API validates payload_json against a strict per-event-type schema before inserting. Any field not explicitly permitted by the schema is rejected at the API layer (not silently dropped — rejected with 400 so the client knows the payload was malformed).
sequenceDiagram
participant U as User (browser)
participant SA as Shadow Aggregator (client JS)
participant AR as Antlers (React)
participant RP as Raptor (main API)
participant AN as Analytics API (separate)
U->>AR: Opens Settings → opts in to strategy_patterns scope
AR->>RP: POST /api/analytics/consent {scope, granted}
RP->>RP: Insert analytics_consent row; write audit_log
RP-->>AR: 200 OK
Note over SA: On next trade event (client-side)
SA->>SA: Observe raw trade event
SA->>SA: Apply bucketing (strategy family, win-rate bucket, regime)
SA->>SA: Suppress if cohort size < k (client-side estimate)
SA->>SA: Derive shadow_pseudonym from PRF output
SA->>SA: Build analytics payload (no PII, bucketed only)
SA->>AN: POST /api/shadow/events {pseudonym, scope, event_type, payload, epoch_week}
Note over AN: No session cookie. No auth header. Pseudonym is the only identity.
AN->>AN: Validate payload schema (reject unknown fields)
AN->>AN: Insert analytics_events row
AN-->>SA: 201 Created
sequenceDiagram
participant U as User
participant AR as Antlers
participant RP as Raptor
participant AN as Analytics API
U->>AR: Settings → Withdraw consent for strategy_patterns
AR->>RP: POST /api/analytics/consent {scope, withdrawn}
RP->>RP: Insert analytics_consent row (withdrawn_at); write audit_log
RP->>RP: Enqueue analytics.shadow.delete job (pseudonym, scope)
RP-->>AR: 200 OK
AR->>AR: Shadow Aggregator stops emitting for that scope immediately
Note over RP: Within 30 days (GDPR Art. 17)
RP->>AN: DELETE /api/shadow/pseudonym/{pseudonym}/scope/{scope}
AN->>AN: Hard-delete all matching analytics_events rows
AN-->>RP: 204 No Content
RP->>RP: Write audit_log: analytics.shadow.delete_completed
RP->>RP: Rotate shadow pseudonym epoch for this user
What an attacker who fully breaches the analytics store can learn:
What they cannot learn:
Re-identification risk surface:
Raxx operator risk (insider threat):
Even an operator with full DB access cannot link analytics records to user accounts without access to the PRF output of the user's passkey — which is never stored. An operator trying to re-identify users would need: (a) the analytics store, (b) the consent table (which records pseudonyms only indirectly — the pseudonym isn't stored there either), and (c) the user's physical authenticator. This is the intended posture.
If the shadow analytics path is chosen:
analytics_consent in the main Raptor DB. No schema changes to existing tables.0002_analytics_consent.sql: Creates analytics_consent table. No data migration (new capability; no existing consent records).analytics_consent table; shut down Analytics API. No user data affected (analytics data is separate; consent records are append-only and deletion is safe).| Phase | Description | Gate |
|---|---|---|
| Dark | Analytics API deployed; no client-side shadow aggregator active; feature flag SHADOW_ANALYTICS=off |
Internal review of schema + threat model |
| Internal beta | Shadow aggregator enabled for 5–10 internal accounts; consent UI visible; verify delete pipeline | 30 days of clean operation; no re-identification events in threat model review |
| Founders beta | Opt-in available to Founders cohort; prominent in settings | Founders consent and withdrawal tested end-to-end |
| GA | All users see opt-in prompt post-login | Analytics store has k≥20 opted-in users before publishing any aggregate |
Kill switch: SHADOW_ANALYTICS=off env flag disables the consent UI, the shadow aggregator client-side load, and the Analytics API intake endpoint simultaneously. Existing analytics data is untouched; it just stops growing.
audit_log. Analytics API has its own append-only operation log.SHADOW_ANALYTICS=off — see §9.| Feature | Notes |
|---|---|
| Cohort-level strategy recommendations | "Users with similar strategy patterns succeeded with credit spreads in low-IV environments" |
| UX error-rate improvement | What flows have highest error rates, which users hit them |
| Regime-pattern discovery | Do high-VIX entries outperform? Aggregate answer becomes possible |
| Aggregate leaderboards (#211) | Leaderboard data can be drawn from opted-in shadow store rather than encrypted main store |
| Business analytics | Strategy-family popularity, cohort retention by engagement pattern |
| Investor/product reporting | Aggregate usage data (no PII) defensible to investors |
| Feature | Why still blocked |
|---|---|
| Per-user support debugging | Support still cannot see actual trades; only shadow-store aggregates |
| Personalized recommendations based on individual history | Shadow store has weekly buckets, not individual history |
| Regulatory discovery of a specific user's trade content | Raxx still cannot decrypt user data |
| Backtest runs on server using user's own data | Compute still must be client-side or user-authorized per-operation |
| Recovery-time data access | If user loses passkey, their E2E data is still unrecoverable from server |
The pure E2E invariant remains. Raxx cannot read individual user trade content. The shadow pipeline is an additive opt-in path; it does not weaken the base encryption posture. A user who never opts in has identical privacy to a pure E2E system.
| Alternative | Privacy cost | Product cost | Implementation cost |
|---|---|---|---|
| Pure E2E, no analytics | None | High — blocks all aggregate insight | Low — simpler architecture |
| Mandatory shadow pipeline (all users) | Moderate — all users share data | Low — full analytics available | Medium |
| Client-side telemetry only, no server aggregation | Low | Medium — crash data only, no behavioral insight | Low |
| Server-side columnar encryption (Raxx-held keys) | High — operator can always read with key | Low — full analytics available | Medium — envelope encryption |
| Opt-in shadow pipeline (this design) | Low for non-opted-in users; moderate for opted-in users | Medium — aggregate insight only for opted-in cohort | High — two services, consent pipeline, delete pipeline |
ADR 0017 path decision. Which of the four alternatives does Kristerpher select? (see ADR 0017 §Decision placeholder)
Analytics store infrastructure. Separate SQLite DB (simple, single-host) or separate Postgres (scalable, operationally heavier)? At <10,000 opted-in users, SQLite is sufficient. At Raxx's current scale, SQLite is the right call; revisit at 50,000 MAU.
k-anonymity floor value. k=20 proposed. Is this high enough? Lower k means more data published but higher re-identification risk. Higher k means less data but stronger privacy guarantee. Decision depends on expected cohort sizes at launch.
Privacy budget ε for DP. ε=1–2 per domain per week is a reasonable starting point. Formal DP requires a privacy budget accounting system. Is implementing this in v1, or is k-anonymity-only acceptable as the v1 anonymization guarantee?
Analytics service boundary. Separate deployed service (own process, own port) vs. a Blueprint within Raptor with separate DB? Separate service is cleaner isolation; Blueprint is simpler to deploy. Recommend separate service; Kristerpher should confirm.
Pseudonym rotation on re-opt-in. When a user withdraws and later re-opts-in, the pseudonym rotates (new epoch). This means the new analytics series can't be linked to the old one by the server, but it also means cohort continuity breaks. Is this the right privacy-vs-analytics tradeoff?
DPIA requirement. Processing behavioral data (even aggregate, even anonymized) at scale for product improvement may require a DPIA under GDPR Art. 35, particularly if Raxx crosses the "systematic large-scale processing" threshold. At Founders scale this likely doesn't apply; revisit at GA with a formal DPIA.