ADR 0016 — Founders Trial: Celery beat for daily sweep, not APScheduler
Status: Proposed
Date: 2026-04-22
Deciders: software-architect
Related: ADR 0013 (MBT paper-trading engine, §8), docs/architecture/founders-trial-engine.md §8
Parent card: #206
Context
The Founders trial engine needs a daily job that scans all non-terminal founder_trial rows, fires warning transitions (30d / 14d / 7d / 1d), and advances expired rows through grace into lapsed. The job must be:
- Catch-up capable (missed runs should self-correct on next execution).
- Idempotent (safe to run multiple times per day without side effects).
- Observable (failures must surface as alertable errors, not silent no-ops).
- Consistent with the rest of the Raptor async job infrastructure.
Two candidates: APScheduler (in-process, simple) or Celery beat (the scheduler that drives all MBT background jobs).
Decision
Use Celery beat.
The Founders daily sweep (founders.daily_sweep) is registered as a Celery beat periodic task on the existing Celery + Redis stack already chosen for MBT (ADR 0013 §8). It is not a new dependency; it is a new task on a running scheduler.
Consequences
Positive
- No new infrastructure. Celery + Redis is already required for MBT jobs (
mbt.eod_mark_all,mbt.purge_expired, etc.). Addingfounders.daily_sweepis a task registration, not a stack addition. - Uniform failure handling. Celery task failures flow into the same alerting path (Flower / Sentry / error queue) as MBT tasks. Ops sees one monitoring surface.
- Worker process separation. If the founders sweep takes longer than expected (large Founders cohort), it runs in a worker process, not in the web-request handling process. APScheduler in-process blocks on the same thread pool.
- Retry semantics. Celery's retry decorator handles transient DB or Redis failures. APScheduler requires manual retry logic.
- Beat schedule is env-driven. The beat schedule dict is loaded from config —
FOUNDERS_PROMO_SCHEDULER_DISABLED=1removes the task from the schedule without a code change.
Negative
- Celery + Redis required from day 1 of the Founders feature. If MBT ships after the Founders promo turns on, Celery must be running regardless. Given MBT is a pre-requisite for Founders (Founders receive MBT Pro-tier accounts), this is a non-issue in practice: MBT brings Celery before Founders needs it.
- Beat tick granularity is 1 minute. The daily sweep fires once at 01:00 UTC; sub-minute precision is irrelevant here.
Alternatives considered
APScheduler (in-process)
APScheduler runs inside the Flask/gunicorn process. It is simple to configure and has no external dependencies.
Rejected because: - In a multi-worker gunicorn setup (multiple processes), APScheduler fires from every worker process simultaneously — the sweep would run N times per minute per machine unless a distributed lock is added. Adding a Redis-based lock to make APScheduler safe is most of the complexity of Celery beat, with fewer benefits. - Failures are unobservable from the standard Celery/Flower monitoring surface. - If the web process crashes and restarts, APScheduler's in-memory state is lost. Celery beat persists its schedule state in the beat store. - APScheduler is not in the existing MBT stack. Adding it solely for Founders creates a second async job runtime to operate.
Standalone cron job / GitHub Actions scheduled workflow
Rejected. Cron on the server is fragile (must be configured per deployment), not portable to container-based hosting, and not visible from the app's internal monitoring. GitHub Actions scheduled workflows have a ±30 min jitter and are not designed for operational DB sweeps.
Temporal workflow
Rejected for v1 per ADR 0013 rationale: adds a dedicated cluster, new mental model, and higher ops footprint. The Founders sweep is a simple daily scan — not a stateful long-running workflow. Temporal is the right answer if we later build multi-day saga orchestration; it is overkill here.
Revisit when
- MBT migrates its own jobs to Temporal. If the whole job stack moves, Founders tasks move with it.
- The Founders cohort grows large enough that the daily sweep takes more than a few minutes. At that scale, consider chunked processing (sweep N rows per task, chain tasks) or a dedicated queue.