Raxx · internal docs

internal · gated

ADR-0088 — Stripe Webhook Failure Strategy: 5xx to Stripe, Not 2xx + Local Queue

Status: Accepted Date: 2026-05-14 UTC Author: software-architect Refs: docs/architecture/queue-stripe-webhook-design-2026-05-14.md §10.1, #1682


Context

When Queue's Postgres write fails mid-webhook (steps 8-12 of the processing pipeline), the handler must choose how to signal failure to Stripe:

Option A: Return 5xx. Stripe retries with exponential backoff for up to 72 hours. The event re-enters the pipeline; because processed_stripe_events was not committed, the retry is treated as a new event.

Option B: Return 2xx and enqueue the event locally (Redis, BullMQ, Postgres job table, or SQS) for local retry. Stripe considers delivery successful and does not retry.

This is a meaningful architectural choice: it determines whether the retry responsibility lives with Stripe's infrastructure or with Queue's infra.


Decision

Option A: Return 5xx on DB write failure. Rely on Stripe's retry mechanism.

The processed_stripe_events dedup table provides idempotency across retries. The pair of (processed_stripe_events + LWW upsert guard) is sufficient to handle Stripe's retry semantics correctly.


Consequences

Positive:

Negative:

Mitigations for the negatives:


Alternatives Considered

Option B: 2xx + Local Queue

Rejected for the following reasons:

  1. Adds infra complexity. A local job queue (Redis/BullMQ or a Postgres pending_events table) is another service or schema to maintain, monitor, and back up. Queue Phase 1 deliberately minimizes infra surface.

  2. Doubles the deduplication problem. With a local queue, there are now two dedup surfaces: Stripe's (suppressed by 2xx) and the local queue's (must deduplicate retries internally). The processed_stripe_events table already solves dedup for Stripe retries; a local queue adds a second dedup layer for its own retry loop.

  3. Stripe's retry semantics are well-specified. Stripe guarantees ≥1 delivery attempt per event within 72 hours. The retry schedule is: 5s, 5m, 30m, 2h, 5h, 10h, 24h (approximately). This is reliable enough for billing state updates where the nightly reconciler is a backstop.

  4. The Stripe customer.subscription.* event chain is strictly ordered by created timestamp. Even if some events are delayed due to retries, the LWW guard (updated_at < EXCLUDED.updated_at) ensures out-of-order delivery does not corrupt state.

  5. No customer-visible impact for typical DB failures. The billing state update path is not in the hot path of customer API requests — it is an async background path. A short DB hiccup that causes Stripe to retry does not affect the customer until the mirror row is stale, and the JIT paywall check's fail-closed behavior already handles that edge correctly.

Option C: 2xx and accept potential data loss

Rejected outright. Acknowledging a money-state event without processing it is not acceptable. Audit trail invariant I-2 requires every money-state change to write to billing_action_log. Silently dropping an event violates that invariant.