Key considerations for designing a resilient payment gateway

Question

Design Gurus · Accepted Answer

A payment gateway is the system that authorizes, routes, and settles financial transactions between a customer's bank and a merchant's bank—encrypting card data, performing fraud screening, communicating with card networks (Visa, Mastercard), and returning authorization decisions in under two seconds. Payment systems are the canonical test of whether a candidate understands correctness at scale. Most distributed systems happily trade consistency for availability; payments cannot. A double-charge is a regulatory event, a trust event, and sometimes a legal event. When an interviewer says "Design Stripe," they are probing whether you grasp that payments are state machines with external actors that cannot be rolled back—and that the hard problems (idempotency, exactly-once execution, ledger correctness, reconciliation) are invisible until you reason about failure modes explicitly.

Key Takeaways

Correctness over availability. Payment systems are CP (consistent and partition-tolerant). A user seeing a brief error is acceptable; charging them twice is not. Every architectural decision must prioritize financial accuracy.  
Idempotency is non-negotiable. Network failures, client retries, and timeout-induced duplicates are inevitable. Every payment request must include a client-generated idempotency key. The server stores key → result mappings atomically with ledger entries and returns the stored result on replay.  
Double-entry bookkeeping is mandatory. Every payment creates two ledger entries: a debit and a credit. The sum of all entries must always be zero. This is how production systems detect bugs before they cost millions.  
Payment transactions are state machines with defined transitions: created → processing → authorized → captured → settled (or failed/refunded at various points). Explicit state tracking prevents the system from relying on a single event for correctness.  
Tokenization keeps you out of PCI DSS scope. Client-side SDKs (Stripe.js, Braintree SDK) collect raw card data in the browser and return a single-use token. Your backend never sees the card number.

Step 1: Requirements and Scope

Functional requirements:

Accept payments: Process credit/debit card payments from customers. Payment lifecycle: Support authorization, capture, settlement, and payout. Refunds: Process full and partial refunds for completed payments. Multi-currency: Accept payments in multiple currencies. Webhooks: Notify merchants of payment status changes asynchronously. Dashboard: Provide merchants with transaction history, analytics, and reconciliation tools.

Non-functional requirements:

Correctness: Zero duplicate charges. Zero lost transactions. Financial accuracy is the primary constraint. Availability: 99.99% uptime—payment downtime directly blocks merchant revenue. Latency: Authorization response within 2 seconds. Security: PCI DSS compliance. Encryption at rest and in transit. No raw card data in our systems. Consistency: Strong consistency for ledger operations. Eventual consistency is not acceptable for financial records. Scalability: Handle 10M transactions per day (~116 QPS average, ~350 QPS peak).

Interview tip: Ask the interviewer: "Are we designing the gateway that communicates with card networks, or the merchant-facing payment service that integrates with an existing gateway like Stripe?" This scoping question dramatically changes the design's complexity.

Step 2: The Payment Lifecycle — A State Machine

Payments are not single events—they are state machines with defined transitions. Modeling them explicitly prevents the most common payment bugs.

State Description Trigger
Created Payment intent recorded; no money moved Merchant initiates payment
Processing Request sent to payment service provider (PSP) System submits to card network
Authorized Card network approves; funds held on customer's card PSP returns approval
Captured Funds transferred from customer's bank to merchant's acquiring bank Merchant confirms delivery
Settled Funds deposited into merchant's account End-of-day batch settlement
Failed Authorization denied by card network or bank PSP returns decline
Refunded Funds returned to customer after capture Merchant initiates refund

Why state machines matter: A payment stuck in "processing" after a network timeout is not lost—the system knows its last known state and can query the PSP for the actual outcome. Without explicit state tracking, the system might re-submit the payment (causing a double-charge) or silently drop it (causing a lost transaction).

Interview application: "Every payment is a state machine with defined transitions. The payment service persists the current state in PostgreSQL before making any external call. If the PSP call times out, the payment remains in 'processing' state. A background reconciler queries the PSP for the actual outcome and advances the state accordingly. This eliminates the ambiguity window where the system does not know what happened."

Step 3: Idempotency — Preventing Double Charges

The fundamental challenge in payment systems: networks are unreliable. A client sends a payment request, the server charges the card, but the response is lost due to a timeout. The client retries—and without idempotency, the customer is charged twice.

Implementation:

The client generates a UUID (idempotency key) per payment intent—before any network communication. Every retry of the same intent uses the same UUID. The server checks if the idempotency key exists in the database before processing. If it exists, the server returns the stored result without re-executing. The idempotency key and ledger entry are written in the same ACID transaction—ensuring atomicity.

Three-layer enforcement (staff-level depth):

Layer 1 — API gateway: Deduplicates by client-generated UUID with 24-hour TTL. Layer 2 — Payment service: Checks idempotency key against the database before PSP submission. Layer 3 — PSP-level: Passes the idempotency key to the external gateway (Stripe, Adyen) to ensure the external charge is never duplicated even if retry logic fires.

The vulnerability window: Between the PSP charging the card and the database recording the result, a crash causes inconsistency. Close this window with a write-ahead log: persist the PSP response to a durable log before updating the database. On restart, replay the log to complete interrupted transactions.

Interview application: "Idempotency is enforced at three layers. The API gateway deduplicates by UUID. The payment service checks the database. The PSP call includes the idempotency key. The vulnerability window between PSP charge and database write is closed by a write-ahead log that replays on startup. Keys expire after 24–72 hours to bound storage growth."

Step 4: Double-Entry Ledger — Financial Accuracy

Every payment creates two ledger entries: a debit from the customer and a credit to the merchant. The sum of all ledger entries must always be zero. This is not optional—it is how production payment systems detect bugs, prevent discrepancies, and satisfy auditors.

Example: Customer pays $50 for a product.

Entry Account Direction Amount
1 Customer (liability) Debit -$50.00
2 Merchant (receivable) Credit +$50.00

If a refund is issued:

Entry Account Direction Amount
3 Merchant (receivable) Debit -$50.00
4 Customer (liability) Credit +$50.00

The sum is always zero. Any imbalance indicates a bug. Automated reconciliation checks run hourly to verify ledger balance, and alerts fire immediately on any discrepancy.

Step 5: Architecture

Token Vault (client-side SDK): Stripe.js or Braintree SDK collects raw card data in the browser, returns a single-use payment method token. Your backend never handles raw card numbers, keeping you out of PCI DSS scope for card data handling.

API Gateway: Terminates TLS, validates API keys or OAuth tokens, enforces per-merchant rate limits, and routes traffic to the payment service.

Payment Service: The core business logic. Validates the request, checks idempotency, creates the payment state machine, calls the PSP, updates the ledger, and publishes events.

PSP Integration Layer: Communicates with external payment processors (Stripe, Adyen, PayPal). Implements circuit breakers with PSP-aware routing: if one PSP fails 5 consecutive requests, traffic switches to a backup PSP—improving authorization rates from ~94% to ~97.5%.

Ledger Service: Manages the double-entry bookkeeping. Writes are ACID transactions in PostgreSQL. The ledger is the source of truth for all financial data.

Reconciliation Service: Runs daily (or more frequently) to compare internal ledger records against PSP settlement reports. Identifies and flags discrepancies for manual review.

Webhook Service: Notifies merchants of payment status changes (payment.succeeded, payment.failed, refund.created) via HTTPS POST with HMAC-SHA256 signatures. Retries failed deliveries 3 times with exponential backoff.

Interview application: "At 10M transactions per day (~116 QPS), a single PostgreSQL primary handles the write load comfortably. The complexity is not scale—it is correctness. The payment service, ledger, and idempotency key are all in the same PostgreSQL instance, enabling single-transaction ACID guarantees across all three. If we reach Visa-level load (~100K TPS), I would shard by merchant_id."

Step 6: Failure Handling

Key considerations for designing a resilient payment gateway

Key Takeaways

Step 1: Requirements and Scope

Step 2: The Payment Lifecycle — A State Machine

Step 3: Idempotency — Preventing Double Charges

Step 4: Double-Entry Ledger — Financial Accuracy

Step 5: Architecture

Step 6: Failure Handling

PSP Timeout

Smart Retry Strategy

Multi-PSP Failover

Step 7: Security and Compliance

Frequently Asked Questions

Why is idempotency the most important concept in payment system design?

What is a double-entry ledger and why is it mandatory?

Should a payment system prioritize consistency or availability?

How does tokenization work for PCI DSS compliance?

What is a payment state machine?

How do you handle PSP timeouts in a payment system?

What is reconciliation and why does it matter?

How do you scale a payment system beyond 100K TPS?

What retry strategy should a payment system use?

What are the most common payment system design mistakes in interviews?

TL;DR

State	Description	Trigger
Created	Payment intent recorded; no money moved	Merchant initiates payment
Processing	Request sent to payment service provider (PSP)	System submits to card network
Authorized	Card network approves; funds held on customer's card	PSP returns approval
Captured	Funds transferred from customer's bank to merchant's acquiring bank	Merchant confirms delivery
Settled	Funds deposited into merchant's account	End-of-day batch settlement
Failed	Authorization denied by card network or bank	PSP returns decline
Refunded	Funds returned to customer after capture	Merchant initiates refund

Entry	Account	Direction	Amount
1	Customer (liability)	Debit	-$50.00
2	Merchant (receivable)	Credit	+$50.00