Raxx · internal docs

internal · gated

ADR 0094 — Founders Gate: Fail-Open Posture and Overshoot Tolerance

Status: Accepted Date: 2026-05-15 UTC Deciders: Kristerpher Henderson (operator), software-architect Parent design: founders-gate-signup-block-2026-05-15.md Parent card: #492


Context

The Founders-gate signup block reads a seat count from Queue on each registration attempt (or from a 30-second in-process cache). Two design choices require explicit decisions:

  1. Fail posture when Queue is unreachable during a gate check. Should Raptor block all signups (fail-closed) or allow signups to proceed (fail-open) when it cannot reach Queue?

  2. Overshoot tolerance. Between a cache read showing gate_open: true and the actual INSERT into queue_customers, the real count could cross the threshold. Should the design attempt to prevent overshoot with a distributed lock, or accept best-effort threshold enforcement?


Decision

Fail-open when Queue is unreachable

When Raptor cannot reach Queue's internal count endpoint (network error, Queue outage, timeout), the gate is treated as open — signups proceed normally.

Best-effort threshold (no distributed lock)

No distributed lock is introduced around the gate check + signup INSERT. An overshoot of fewer than 5 seats is accepted as a product-level tolerance. The gate is documented as "best-effort threshold."


Consequences

Fail-open

Positive: The signup path remains available during Queue outages. At T-8 days before v1 launch, availability of the signup critical path outweighs the overshoot risk from a temporary Queue outage.

Negative: During a Queue outage while the cohort is near or at threshold, some signups that should have been blocked will go through. Mitigation: Sentry alert on Queue endpoint 5xx rate will surface this quickly. Operator can manually revoke overshoot accounts via the admin panel.

Invariant: The fail-open path still emits an audit_log row with action=founders.gate.check_failed (distinct from founders.gate.rejected) so volume of fail-open events is visible.

Best-effort threshold

Positive: No Redis dependency, no lock contention, simpler code surface. At v1 cohort sizes (sub-500), race conditions are statistically rare — concurrent simultaneous registrations at exactly the threshold boundary are an edge case.

Negative: Threshold is approximate, not exact. The cohort could have threshold + N seats where N is bounded by the concurrency window (expected < 5).

Accepted: The operator confirmed that a small overshoot is acceptable. Exact enforcement is not a product requirement.


Threshold-as-env (not feature_flags.yaml)

FOUNDERS_COHORT_THRESHOLD is an env var, not a flag YAML field. Rationale:


Alternatives considered

Fail-closed

Block all signups when Queue is unreachable. Rejected: this makes signup availability dependent on Queue availability. A Queue deploy or restart would take down signups. The signup path is higher-stakes than a brief overshoot.

Distributed lock (Redis SETNX or Postgres advisory lock)

Introduce a lock around the check + insert to guarantee exact threshold enforcement. Rejected: - Adds a Redis or advisory-lock dependency to the registration path. - Increases latency on every signup. - The cohort size and launch timeline do not warrant this complexity. - If exact enforcement is needed post-launch, it can be added as a follow-on card without breaking the current design.

Threshold in feature_flags.yaml as an integer field

Extend the flag YAML schema to support integer values. Rejected: the flag promotion pipeline, console UI, and B1 migration pattern are all boolean-oriented. Extending to integers is a larger schema change than warranted for a single threshold value. Use env var instead.