ADR 0016 — Founders Trial: Celery beat for daily sweep, not APScheduler

Status: Proposed Date: 2026-04-22 Deciders: software-architect Related: ADR 0013 (MBT paper-trading engine, §8), docs/architecture/founders-trial-engine.md §8 Parent card: #206

Context

The Founders trial engine needs a daily job that scans all non-terminal founder_trial rows, fires warning transitions (30d / 14d / 7d / 1d), and advances expired rows through grace into lapsed. The job must be:

Catch-up capable (missed runs should self-correct on next execution).
Idempotent (safe to run multiple times per day without side effects).
Observable (failures must surface as alertable errors, not silent no-ops).
Consistent with the rest of the Raptor async job infrastructure.

Two candidates: APScheduler (in-process, simple) or Celery beat (the scheduler that drives all MBT background jobs).

Decision

Use Celery beat.

The Founders daily sweep (founders.daily_sweep) is registered as a Celery beat periodic task on the existing Celery + Redis stack already chosen for MBT (ADR 0013 §8). It is not a new dependency; it is a new task on a running scheduler.

Consequences

Positive

No new infrastructure. Celery + Redis is already required for MBT jobs (mbt.eod_mark_all, mbt.purge_expired, etc.). Adding founders.daily_sweep is a task registration, not a stack addition.
Uniform failure handling. Celery task failures flow into the same alerting path (Flower / Sentry / error queue) as MBT tasks. Ops sees one monitoring surface.
Worker process separation. If the founders sweep takes longer than expected (large Founders cohort), it runs in a worker process, not in the web-request handling process. APScheduler in-process blocks on the same thread pool.
Retry semantics. Celery's retry decorator handles transient DB or Redis failures. APScheduler requires manual retry logic.
Beat schedule is env-driven. The beat schedule dict is loaded from config — FOUNDERS_PROMO_SCHEDULER_DISABLED=1 removes the task from the schedule without a code change.

Negative

Celery + Redis required from day 1 of the Founders feature. If MBT ships after the Founders promo turns on, Celery must be running regardless. Given MBT is a pre-requisite for Founders (Founders receive MBT Pro-tier accounts), this is a non-issue in practice: MBT brings Celery before Founders needs it.
Beat tick granularity is 1 minute. The daily sweep fires once at 01:00 UTC; sub-minute precision is irrelevant here.

Alternatives considered

APScheduler (in-process)

APScheduler runs inside the Flask/gunicorn process. It is simple to configure and has no external dependencies.

Rejected because: - In a multi-worker gunicorn setup (multiple processes), APScheduler fires from every worker process simultaneously — the sweep would run N times per minute per machine unless a distributed lock is added. Adding a Redis-based lock to make APScheduler safe is most of the complexity of Celery beat, with fewer benefits. - Failures are unobservable from the standard Celery/Flower monitoring surface. - If the web process crashes and restarts, APScheduler's in-memory state is lost. Celery beat persists its schedule state in the beat store. - APScheduler is not in the existing MBT stack. Adding it solely for Founders creates a second async job runtime to operate.

Standalone cron job / GitHub Actions scheduled workflow

Rejected. Cron on the server is fragile (must be configured per deployment), not portable to container-based hosting, and not visible from the app's internal monitoring. GitHub Actions scheduled workflows have a ±30 min jitter and are not designed for operational DB sweeps.

Temporal workflow

Rejected for v1 per ADR 0013 rationale: adds a dedicated cluster, new mental model, and higher ops footprint. The Founders sweep is a simple daily scan — not a stateful long-running workflow. Temporal is the right answer if we later build multi-day saga orchestration; it is overkill here.

Revisit when

MBT migrates its own jobs to Temporal. If the whole job stack moves, Founders tasks move with it.
The Founders cohort grows large enough that the daily sweep takes more than a few minutes. At that scale, consider chunked processing (sweep N rows per task, chain tasks) or a dedicated queue.