ADR 0094 — Founders Gate: Fail-Open Posture and Overshoot Tolerance
Status: Accepted
Date: 2026-05-15 UTC
Deciders: Kristerpher Henderson (operator), software-architect
Parent design: founders-gate-signup-block-2026-05-15.md
Parent card: #492
Context
The Founders-gate signup block reads a seat count from Queue on each registration attempt (or from a 30-second in-process cache). Two design choices require explicit decisions:
-
Fail posture when Queue is unreachable during a gate check. Should Raptor block all signups (fail-closed) or allow signups to proceed (fail-open) when it cannot reach Queue?
-
Overshoot tolerance. Between a cache read showing
gate_open: trueand the actualINSERTintoqueue_customers, the real count could cross the threshold. Should the design attempt to prevent overshoot with a distributed lock, or accept best-effort threshold enforcement?
Decision
Fail-open when Queue is unreachable
When Raptor cannot reach Queue's internal count endpoint (network error, Queue outage, timeout), the gate is treated as open — signups proceed normally.
Best-effort threshold (no distributed lock)
No distributed lock is introduced around the gate check + signup INSERT. An overshoot of fewer than 5 seats is accepted as a product-level tolerance. The gate is documented as "best-effort threshold."
Consequences
Fail-open
Positive: The signup path remains available during Queue outages. At T-8 days before v1 launch, availability of the signup critical path outweighs the overshoot risk from a temporary Queue outage.
Negative: During a Queue outage while the cohort is near or at threshold, some signups that should have been blocked will go through. Mitigation: Sentry alert on Queue endpoint 5xx rate will surface this quickly. Operator can manually revoke overshoot accounts via the admin panel.
Invariant: The fail-open path still emits an audit_log row with action=founders.gate.check_failed (distinct from founders.gate.rejected) so volume of fail-open events is visible.
Best-effort threshold
Positive: No Redis dependency, no lock contention, simpler code surface. At v1 cohort sizes (sub-500), race conditions are statistically rare — concurrent simultaneous registrations at exactly the threshold boundary are an edge case.
Negative: Threshold is approximate, not exact. The cohort could have threshold + N seats where N is bounded by the concurrency window (expected < 5).
Accepted: The operator confirmed that a small overshoot is acceptable. Exact enforcement is not a product requirement.
Threshold-as-env (not feature_flags.yaml)
FOUNDERS_COHORT_THRESHOLD is an env var, not a flag YAML field. Rationale:
- The threshold is operational config with a numeric type. Feature flag YAML is boolean-oriented (on/off). Mixing numeric thresholds into the flag system adds a type-coercion surface.
- The value can be changed without a deploy (
heroku config:set), same as a flag flip. - Storing the threshold in YAML would require a PR + B1 migration for every cohort-size adjustment. An env var allows in-day adjustments during the live launch window.
- The flag
FLAG_FOUNDERS_GATEremains in YAML (on/off gate for the feature). The threshold travels separately as an env var.
Alternatives considered
Fail-closed
Block all signups when Queue is unreachable. Rejected: this makes signup availability dependent on Queue availability. A Queue deploy or restart would take down signups. The signup path is higher-stakes than a brief overshoot.
Distributed lock (Redis SETNX or Postgres advisory lock)
Introduce a lock around the check + insert to guarantee exact threshold enforcement. Rejected: - Adds a Redis or advisory-lock dependency to the registration path. - Increases latency on every signup. - The cohort size and launch timeline do not warrant this complexity. - If exact enforcement is needed post-launch, it can be added as a follow-on card without breaking the current design.
Threshold in feature_flags.yaml as an integer field
Extend the flag YAML schema to support integer values. Rejected: the flag promotion pipeline, console UI, and B1 migration pattern are all boolean-oriented. Extending to integers is a larger schema change than warranted for a single threshold value. Use env var instead.