ADR-0088 — Stripe Webhook Failure Strategy: 5xx to Stripe, Not 2xx + Local Queue
Status: Accepted
Date: 2026-05-14 UTC
Author: software-architect
Refs: docs/architecture/queue-stripe-webhook-design-2026-05-14.md §10.1, #1682
Context
When Queue's Postgres write fails mid-webhook (steps 8-12 of the processing pipeline), the handler must choose how to signal failure to Stripe:
Option A: Return 5xx. Stripe retries with exponential backoff for up to 72 hours. The event re-enters the pipeline; because processed_stripe_events was not committed, the retry is treated as a new event.
Option B: Return 2xx and enqueue the event locally (Redis, BullMQ, Postgres job table, or SQS) for local retry. Stripe considers delivery successful and does not retry.
This is a meaningful architectural choice: it determines whether the retry responsibility lives with Stripe's infrastructure or with Queue's infra.
Decision
Option A: Return 5xx on DB write failure. Rely on Stripe's retry mechanism.
The processed_stripe_events dedup table provides idempotency across retries. The pair of (processed_stripe_events + LWW upsert guard) is sufficient to handle Stripe's retry semantics correctly.
Consequences
Positive:
- No additional infra dependency (no Redis, no job table schema, no worker process).
- Retry semantics are provided by Stripe, which has battle-tested exponential backoff with 72-hour windows — effectively infinite for the operational window that matters.
- Simpler codebase: the failure path is "fail loud, let Stripe retry" rather than "enqueue, poll, retry, deduplicate local queue against webhook dedup".
- Dead-letter semantics: if a webhook fails to process for 72 hours, Stripe stops retrying and logs the failure. This is visible in the Stripe dashboard. Operator can manually trigger resync via the reconciler (
/api/billing/reconcile).
Negative:
- If Queue is down for more than 72 hours, Stripe abandons the event delivery. Residual drift is caught by the nightly reconciler within 24h of Queue recovery.
- During a prolonged outage, customers whose billing state changed during the outage may have stale local state until the reconciler runs. The reconciler does not auto-correct (it logs drift for operator review per stripe-customer-billing.md §5.3).
Mitigations for the negatives:
- 72-hour Stripe retry window exceeds any realistic Queue outage for a Heroku-hosted service.
- The nightly reconciler is an independent catch-all that corrects drift regardless of cause.
FLAG_QUEUE_BILLING=falsekill-switch returns 503, which Stripe queues and replays automatically — this is intentional and tested in E-7 staging soak.
Alternatives Considered
Option B: 2xx + Local Queue
Rejected for the following reasons:
-
Adds infra complexity. A local job queue (Redis/BullMQ or a Postgres
pending_eventstable) is another service or schema to maintain, monitor, and back up. Queue Phase 1 deliberately minimizes infra surface. -
Doubles the deduplication problem. With a local queue, there are now two dedup surfaces: Stripe's (suppressed by 2xx) and the local queue's (must deduplicate retries internally). The
processed_stripe_eventstable already solves dedup for Stripe retries; a local queue adds a second dedup layer for its own retry loop. -
Stripe's retry semantics are well-specified. Stripe guarantees ≥1 delivery attempt per event within 72 hours. The retry schedule is: 5s, 5m, 30m, 2h, 5h, 10h, 24h (approximately). This is reliable enough for billing state updates where the nightly reconciler is a backstop.
-
The Stripe
customer.subscription.*event chain is strictly ordered bycreatedtimestamp. Even if some events are delayed due to retries, the LWW guard (updated_at < EXCLUDED.updated_at) ensures out-of-order delivery does not corrupt state. -
No customer-visible impact for typical DB failures. The billing state update path is not in the hot path of customer API requests — it is an async background path. A short DB hiccup that causes Stripe to retry does not affect the customer until the mirror row is stale, and the JIT paywall check's fail-closed behavior already handles that edge correctly.
Option C: 2xx and accept potential data loss
Rejected outright. Acknowledging a money-state event without processing it is not acceptable. Audit trail invariant I-2 requires every money-state change to write to billing_action_log. Silently dropping an event violates that invariant.