Raxx · internal docs

internal · gated ↑ index

ADR 0021 — Trace Storage: Timescale vs Plain Postgres vs ClickHouse vs Others

Status: Accepted (conditional — see open question in §Consequences)
Date: 2026-04-29
Deciders: software-architect
Refs: workflow-uuid-tracing.md, ADR-0003


Context

The workflow-UUID tracing design requires a storage backend that can:

  1. Append events at ~20–50 per user-session without write contention.
  2. Answer "reconstruct this user's day" queries efficiently as history grows across years.
  3. Compress repetitive, append-only rows (many rows share user_id, workflow_id, short action_type strings).
  4. Support a cold-storage tier so years of trade-affecting events (7-year retention per ADR-0003) do not keep hot-tier costs linear.
  5. Integrate with Heroku (Raptor's current platform) without introducing a new ops paradigm.
  6. Survive pre-launch on a budget — no $500/month infra commitment before the first paying customer.

Candidates evaluated: Timescale (Postgres + extension), plain Postgres (partitioned), ClickHouse, InfluxDB, AWS Timestream, SQLite (4-week MVP path).


Decision

Use Timescale (TimescaleDB extension on Postgres) as the production trace store, with SQLite as a valid 4-week MVP shortcut.

Why Timescale over plain Postgres with partitioning

Postgres declarative partitioning on ts_emitted works but requires manual partition creation per time range, manual index management per partition, and a separate housekeeping job to detach old partitions for cold storage. Timescale's create_hypertable automates all of this:

Why not ClickHouse

ClickHouse is purpose-built for OLAP on append-only data and outperforms Postgres at hundreds of millions of rows. But:

Why not InfluxDB

InfluxDB's data model (measurements, tags, fields) fits infrastructure metrics well but is a poor fit for Raxx's event schema, which is a heterogeneous record with varying context fields. The Flux query language is a separate skill. InfluxDB Cloud has a separate billing model and does not compose with the existing Postgres-centric Raptor codebase. Rejected.

Why not AWS Timestream

AWS Timestream is managed, serverless, and cheap at low volume. But it couples the trace store to AWS permanently — a platform change from Heroku to Fly.io or Railway would require a migration path. Timestream queries use a custom SQL dialect. No Heroku add-on. Rejected for pre-launch; acceptable as a v2 option if Raxx migrates fully to AWS infrastructure.

Why SQLite is valid for the 4-week path

At MVP volume (<10 users, <10,000 events total), SQLite with a composite index on (user_id, ts_emitted) is entirely sufficient. The schema is identical to the Timescale schema minus the create_hypertable call. When Raptor migrates from SQLite to Postgres (a planned step in the backend roadmap), the trace table migrates with it and create_hypertable is called on the existing data. No data is lost; Timescale applies partitioning retroactively.

If Raptor is already on Heroku Postgres (not SQLite), the SQLite path is moot — go straight to Timescale.


Consequences

Positive

Negative

Open question

This ADR's database path decision is conditional on the answer to Open Question 1 in workflow-uuid-tracing.md §13: is Raptor on SQLite or Heroku Postgres today? If Postgres, the recommendation is Timescale immediately. If SQLite, the recommendation is SQLite for MVP, Timescale at the first Postgres migration.


Alternatives Considered (summary)

Candidate Verdict Reason
Timescale Recommended Postgres-native; auto-partitioning; compression; tiered storage; Heroku-compatible
Plain Postgres + partitioning Acceptable fallback Works, but manual partition management and no built-in compression
ClickHouse Deferred to v2 at scale Better at 100M+ rows; new ops paradigm; GDPR mutation complexity
InfluxDB Rejected Inappropriate data model; separate query language; no Heroku add-on
AWS Timestream Rejected pre-launch AWS vendor lock-in; custom SQL dialect; no Heroku integration
SQLite MVP-only Sufficient for <10K events; migrates forward to Timescale without data loss