Raxx · internal docs

internal · gated

Data Protection Impact Assessment — Shadow Analytics Pipeline

Status: Draft — requires attorney review before relying on this document for GDPR defense. Date: 2026-05-17 Author: Raxx engineering (architectural substrate); legal review pending. Scope: Opt-in shadow aggregate analytics pipeline for the Raxx platform. Lawful basis: GDPR Art. 6(1)(a) — freely given, specific, withdrawable consent. Parent issue: #279 Consultation record: PR #268 (2026-04-24), ADR 0017, ADR 0018


1. Processing activity description

What data is collected

The shadow analytics pipeline collects weekly bucketed aggregate signals only. No individual trade records, no raw user events, and no PII are transmitted to the analytics store.

Signal shapes at v1 (Goals 1 + 2 per ADR 0018):

Signal category What is bucketed Granularity
Strategy patterns strategy-family x market-regime x win-rate-bucket x hold-period-bucket Weekly aggregate count per bucket
Product improvement error-type histograms; page-dwell-time buckets (e.g. <10s, 10–60s, >60s); feature-visit frequency; strategy-construction-abandoned-at-step counts Weekly aggregate count per bucket

Bucket design ensures no single user's action is identifiable: all buckets with fewer than k=20 contributing users are suppressed entirely and never written to the analytics store (see Section 4).

How it is anonymized

  1. Client-side aggregation. The shadow aggregator runs as client-side JavaScript. Raw user events are aggregated locally into weekly buckets before any data leaves the client. The server never receives raw event data.
  2. Pseudonym derivation. Each opted-in user is assigned a pseudonym derived client-side from the WebAuthn PRF output. The server never receives or stores the derivation input. The pseudonym rotates on each re-opt-in (prior data is hard-deleted; re-opt-in starts a fresh pseudonym).
  3. k=20 k-anonymity floor. The analytics service enforces a minimum k=20 at the query layer: any bucket containing fewer than 20 distinct contributing pseudonyms is suppressed. This is enforced by the I_ACCEPT_WEAKER_ANONYMITY environment-variable guard (see Runbook §5).
  4. No PII columns. The analytics schema contains no email addresses, usernames, account IDs, IP addresses, or any other directly identifying attribute. This is enforced at the migration level and reviewed on schema change.
  5. No FK to users table. The analytics store is a structurally isolated service with no shared database and no foreign key reference to the main users table. Re-identification via join is structurally impossible, not merely policy-prohibited.

Where it is stored

Analytics data is stored in a dedicated EU-region Postgres instance operated as a separate Heroku app (raxx-analytics-prod). It shares no infrastructure with the primary Raptor API (raxx-api-prod) or Queue service. Cross-border transfer to non-EU regions is not permitted without a DPIA re-run (see Section 6).

How long it is retained


2. Necessity and proportionality assessment

Why this data is needed

The shadow analytics pipeline serves two concrete product goals (ADR 0018 §Ranking):

Goal 1 — Product improvement: Raxx needs to understand which UX flows break, which strategies confuse users, which error paths are common, and which features go unused. Without any behavioral signal, product improvement relies entirely on qualitative feedback, which is low-volume and subject to selection bias. Aggregate error histograms and feature-usage counts are the minimum signal needed to make data-informed product decisions.

Goal 2 — Strategy recommendations: Raxx's platform value increases when recommendations are grounded in what actually worked for users in similar market regimes and risk profiles. Aggregate strategy-family x regime x win-rate bucketing provides that signal at population level without exposing any individual's trades.

Why at this scale

The k=20 anonymity floor means that signal is only recorded when at least 20 users have contributed to a bucket in a given week. This is the minimum population size at which individual contribution becomes structurally invisible. Strategy buckets with fewer than 20 contributors are suppressed entirely — this creates blind spots at launch (when opt-in counts are low) as an accepted product cost for the anonymity guarantee.

Alternatives considered

Option A — Pure E2E, no analytics (ADR 0017): Accepted as the strongest privacy posture. Rejected as the primary posture because it forecloses the product improvement and recommendation feedback loops that produce direct user value at acceptable privacy cost when opt-in is freely given. This option remains available as a user-level choice: users who do not opt in receive identical privacy to Option A.

Option C — Mandatory shadow pipeline (ADR 0017): Rejected on GDPR grounds. Mandatory behavioral telemetry without freely-given consent lacks a clean lawful basis under Art. 6. Legitimate interest is not applicable given the nature of the processing.

Option D — Server-side columnar encryption with server-held keys (ADR 0017): Rejected as incompatible with Raxx's E2E encryption base invariant. Server-held keys mean Raxx can always read user content, which forfeits the privacy differentiation that is central to the product.

Proportionality

Processing is limited to aggregate weekly buckets (not continuous streams), is opt-in only, covers Goals 1 and 2 only (Goal 3 — data-as-product — requires a separate consent flow and new DPIA), and is structurally incapable of producing individual-user output. The data processing is proportionate to the product improvement and recommendation goals it serves.


3. Risks to data subjects

Risk 1: Re-identification

Description: An adversary with access to the analytics store could attempt to re-identify individual users by correlating bucket membership with external knowledge (e.g., public trading records, community posts about strategy use).

Likelihood: Low. k=20 suppression means no bucket is recorded with fewer than 20 contributors. Strategy buckets are broad (family x regime x win-rate range x hold-period range), making individual-bucket-to-person mapping difficult even with auxiliary information. No pseudonym-to-user mapping exists server-side.

Residual risk: Non-zero at low user counts where a suppressed bucket becomes populated by exactly 20 users who are all identifiable by other means. This is a known limitation of k-anonymity without differential privacy noise.

Mitigation: k=20 floor (Section 4.1); bucket design uses ranges not exact values; no PII columns; no FK to users table. Revisit with differential privacy (Laplace noise) at v1.5 if re-identification risk is assessed to have increased (ADR 0017 §Revisit When).

Risk 2: Cross-border transfer

Description: Analytics data stored outside the EU/EEA would require a transfer mechanism under GDPR Chapter V (Standard Contractual Clauses, Adequacy Decision, or Binding Corporate Rules).

Likelihood: Low under current configuration. Analytics store is EU-region by default.

Mitigation: Analytics store region is locked to EU in Heroku configuration. Any region change triggers a mandatory DPIA re-run (Section 6). Infrastructure-as-code review required before any region migration.

Risk 3: Analytics store breach

Description: An attacker who obtains read access to the analytics store could extract the aggregate bucket data.

Likelihood: Low-medium (any database can be breached).

Impact: Low. A breach of the analytics store reveals only aggregate bucket counts per strategy-family x regime x win-rate x hold-period week. No individual trades, no PII, no account information. The blast radius of a full analytics store breach is "aggregate behavioral statistics" — not individual user exposure.

Mitigation: Analytics store access is gated behind service-to-service authentication (ANALYTICS_AUTH_TOKEN, rotated per Runbook §2). No public read endpoints. CF Access or equivalent perimeter protection on the analytics service.

Risk 4: Repurposing for commercial data product

Description: Goal 3 (anonymized data as a commercial product — licensing to researchers or institutional buyers) was identified in ADR 0018 as a potential future use. If Goal 3 were pursued using data collected under the v1 Goal 1+2 consent, the processing would exceed the scope of the original consent.

Likelihood: Zero if the architectural guardrails below are respected.

Mitigation: Goal 3 is explicitly out of scope for v1 consent. The consent toggle copy covers only Goals 1 and 2. Any Goal 3 product line requires: (a) a new ADR, (b) a separate consent flow with distinct language, (c) a new DPIA, and (d) legal review. The analytics service does not expose a Goal 3 API endpoint; any such endpoint would require a code change that triggers this review chain.

Description: A user who rapidly toggles consent could attempt to use the cycling as a side-channel to probe aggregate bucket membership (e.g., "does my contribution flip the k=20 threshold?").

Likelihood: Very low. The k-floor suppresses the signal entirely below the threshold; cycling does not reveal individual bucket membership.

Mitigation: Rapid cycling (>3 transitions in <30 days) is flagged in the consent_history table and surfaces a UX prompt to the user (ADR 0018 §3). No lockout; the flag is a signal that UX needs work for that user.


4. Measures to mitigate risks

4.1 k=20 anonymity floor

The analytics service enforces k=20 at the query layer. Any bucket with fewer than 20 distinct contributing pseudonyms in the reference period is suppressed — the bucket is not written to the analytics store for that week. The I_ACCEPT_WEAKER_ANONYMITY environment variable is the override guard; it must be explicitly set to reduce the floor and requires Kristerpher sign-off plus a DPIA re-run note before any production override. See Runbook §5.

On opt-out, a hard-delete job is queued immediately. The job must complete within 30 days (GDPR Art. 17 SLA). The delete pipeline removes all records keyed to the user's pseudonym from the analytics store. The consent_history table retains an opt-out event record (operational metadata, server-readable) but no shadow-analytics data survives post-delete. See Runbook §4 for debug procedure on stalled deletes.

4.3 Pseudonym rotation on re-opt-in

When a user re-opts-in after a prior opt-out, their previous shadow data has been hard-deleted per §4.2. Re-opt-in derives a new pseudonym (client-side, from the WebAuthn PRF output). There is no continuity of pseudonym between the old and new opt-in periods; a re-opted-in user contributes fresh data from re-opt-in date forward only. This is surfaced to the user in the re-opt-in confirmation screen (ADR 0018 §3).

4.4 Separate service with no FK to users table

The analytics service is a structurally isolated Heroku app with its own database. No foreign key reference to the Queue service's users table or any other user-identity table exists. Re-identification via relational join is structurally impossible. Any schema change that would introduce a user-identity reference requires PM + security-agent review before landing.

4.5 No PII columns in analytics schema

The analytics schema is enforced at the migration level to contain no PII: no email, no username, no account ID, no IP address, no device fingerprint, no geolocation. Pseudonyms are opaque cryptographic identifiers derived client-side. Schema review on any migration is required to confirm no PII column is being added.

An append-only consent_history table records every consent state transition per user (opt-in, opt-out, re-opt-in timestamps). This table is operational metadata (server-readable) rather than E2E-encrypted user content, because GDPR Art. 7 compliance requires Raxx to prove consent-state-at-time independently of the user's presence. The table is sufficient to answer "what was this user's consent state at time T?" per ADR 0018 open question resolution.


5. Consultation record

Item Reference
Architecture decision (Option B selected) ADR 0017 — E2E with shadow-analytics posture
Consent UX decisions (single toggle, re-opt-in, Goals 1+2) ADR 0018 — Shadow-analytics data goals + consent-UX consequences
Decision thread PR #268 review (2026-04-24)
Deciders Kristerpher Henderson (product owner)
Engineering software-architect agent

Attorney review status: Pending. This document is the architectural substrate for a GDPR Art. 35 DPIA. An attorney with GDPR expertise must review this document before it is relied upon as a defense in a regulatory proceeding. See issue #279 "Out of scope" note.


6. Review cadence

This DPIA must be re-run when any of the following conditions occur (whichever comes first):

  1. Annual review — not later than 2027-05-17.
  2. New data goal added — specifically if Goal 3 (data-as-product / commercial licensing) is pursued. Requires a new DPIA covering the expanded processing scope and a new consent flow.
  3. k-floor reduced — any reduction of the k=20 floor weakens the anonymity guarantee and requires re-assessment of re-identification risk (Section 3, Risk 1).
  4. New storage region added — any analytics data stored outside the EU/EEA requires a transfer mechanism review (Section 3, Risk 2).
  5. New data columns added to analytics schema — each new signal type must be assessed for necessity, proportionality, and re-identification risk.
  6. Analytics store shared with a third party — sharing with any external party (researcher, institutional buyer, SaaS vendor with data-access rights) triggers the Goal 3 review chain.
  7. User count crosses 10,000 opted-in users — at this scale, the k=20 floor's practical protection changes; re-assess suppression rate and consider adding differential privacy noise (Laplace mechanism) at the query layer per ADR 0017 §Revisit When.
  8. Material change to EU GDPR Art. 35 guidance — any new ICO, CNIL, or EDPB guidance specifically addressing behavioral analytics under E2E encryption triggers a review.

Owner: Kristerpher Henderson. The security-agent flags material changes in CI (schema migrations, region changes, new signal types) and files a review-trigger issue when a condition above is detected.