ADR-0088 — Stripe Webhook Failure Strategy: 5xx to Stripe, Not 2xx + Local Queue

Status: Accepted Date: 2026-05-14 UTC Author: software-architect Refs: docs/architecture/queue-stripe-webhook-design-2026-05-14.md §10.1, #1682

Context

When Queue's Postgres write fails mid-webhook (steps 8-12 of the processing pipeline), the handler must choose how to signal failure to Stripe:

Option A: Return 5xx. Stripe retries with exponential backoff for up to 72 hours. The event re-enters the pipeline; because processed_stripe_events was not committed, the retry is treated as a new event.

Option B: Return 2xx and enqueue the event locally (Redis, BullMQ, Postgres job table, or SQS) for local retry. Stripe considers delivery successful and does not retry.

This is a meaningful architectural choice: it determines whether the retry responsibility lives with Stripe's infrastructure or with Queue's infra.

Decision

Option A: Return 5xx on DB write failure. Rely on Stripe's retry mechanism.

The processed_stripe_events dedup table provides idempotency across retries. The pair of (processed_stripe_events + LWW upsert guard) is sufficient to handle Stripe's retry semantics correctly.

Consequences

Positive:

No additional infra dependency (no Redis, no job table schema, no worker process).
Retry semantics are provided by Stripe, which has battle-tested exponential backoff with 72-hour windows — effectively infinite for the operational window that matters.
Simpler codebase: the failure path is "fail loud, let Stripe retry" rather than "enqueue, poll, retry, deduplicate local queue against webhook dedup".
Dead-letter semantics: if a webhook fails to process for 72 hours, Stripe stops retrying and logs the failure. This is visible in the Stripe dashboard. Operator can manually trigger resync via the reconciler (/api/billing/reconcile).

Negative:

If Queue is down for more than 72 hours, Stripe abandons the event delivery. Residual drift is caught by the nightly reconciler within 24h of Queue recovery.
During a prolonged outage, customers whose billing state changed during the outage may have stale local state until the reconciler runs. The reconciler does not auto-correct (it logs drift for operator review per stripe-customer-billing.md §5.3).

Mitigations for the negatives:

72-hour Stripe retry window exceeds any realistic Queue outage for a Heroku-hosted service.
The nightly reconciler is an independent catch-all that corrects drift regardless of cause.
FLAG_QUEUE_BILLING=false kill-switch returns 503, which Stripe queues and replays automatically — this is intentional and tested in E-7 staging soak.

Alternatives Considered

Option B: 2xx + Local Queue

Rejected for the following reasons:

Adds infra complexity. A local job queue (Redis/BullMQ or a Postgres pending_events table) is another service or schema to maintain, monitor, and back up. Queue Phase 1 deliberately minimizes infra surface.
Doubles the deduplication problem. With a local queue, there are now two dedup surfaces: Stripe's (suppressed by 2xx) and the local queue's (must deduplicate retries internally). The processed_stripe_events table already solves dedup for Stripe retries; a local queue adds a second dedup layer for its own retry loop.
Stripe's retry semantics are well-specified. Stripe guarantees ≥1 delivery attempt per event within 72 hours. The retry schedule is: 5s, 5m, 30m, 2h, 5h, 10h, 24h (approximately). This is reliable enough for billing state updates where the nightly reconciler is a backstop.
The Stripe customer.subscription.* event chain is strictly ordered by created timestamp. Even if some events are delayed due to retries, the LWW guard (updated_at < EXCLUDED.updated_at) ensures out-of-order delivery does not corrupt state.
No customer-visible impact for typical DB failures. The billing state update path is not in the hot path of customer API requests — it is an async background path. A short DB hiccup that causes Stripe to retry does not affect the customer until the mirror row is stale, and the JIT paywall check's fail-closed behavior already handles that edge correctly.

Option C: 2xx and accept potential data loss

Rejected outright. Acknowledging a money-state event without processing it is not acceptable. Audit trail invariant I-2 requires every money-state change to write to billing_action_log. Silently dropping an event violates that invariant.