Key considerations for designing a resilient payment gateway

A payment gateway is the system that authorizes, routes, and settles financial transactions between a customer's bank and a merchant's bank—encrypting card data, performing fraud screening, communicating with card networks (Visa, Mastercard), and returning authorization decisions in under two seconds. Payment systems are the canonical test of whether a candidate understands correctness at scale. Most distributed systems happily trade consistency for availability; payments cannot. A double-charge is a regulatory event, a trust event, and sometimes a legal event. When an interviewer says "Design Stripe," they are probing whether you grasp that payments are state machines with external actors that cannot be rolled back—and that the hard problems (idempotency, exactly-once execution, ledger correctness, reconciliation) are invisible until you reason about failure modes explicitly.

Key Takeaways

  • Correctness over availability. Payment systems are CP (consistent and partition-tolerant). A user seeing a brief error is acceptable; charging them twice is not. Every architectural decision must prioritize financial accuracy.
  • Idempotency is non-negotiable. Network failures, client retries, and timeout-induced duplicates are inevitable. Every payment request must include a client-generated idempotency key. The server stores key → result mappings atomically with ledger entries and returns the stored result on replay.
  • Double-entry bookkeeping is mandatory. Every payment creates two ledger entries: a debit and a credit. The sum of all entries must always be zero. This is how production systems detect bugs before they cost millions.
  • Payment transactions are state machines with defined transitions: created → processing → authorized → captured → settled (or failed/refunded at various points). Explicit state tracking prevents the system from relying on a single event for correctness.
  • Tokenization keeps you out of PCI DSS scope. Client-side SDKs (Stripe.js, Braintree SDK) collect raw card data in the browser and return a single-use token. Your backend never sees the card number.

Step 1: Requirements and Scope

Functional requirements:

Accept payments: Process credit/debit card payments from customers. Payment lifecycle: Support authorization, capture, settlement, and payout. Refunds: Process full and partial refunds for completed payments. Multi-currency: Accept payments in multiple currencies. Webhooks: Notify merchants of payment status changes asynchronously. Dashboard: Provide merchants with transaction history, analytics, and reconciliation tools.

Non-functional requirements:

Correctness: Zero duplicate charges. Zero lost transactions. Financial accuracy is the primary constraint. Availability: 99.99% uptime—payment downtime directly blocks merchant revenue. Latency: Authorization response within 2 seconds. Security: PCI DSS compliance. Encryption at rest and in transit. No raw card data in our systems. Consistency: Strong consistency for ledger operations. Eventual consistency is not acceptable for financial records. Scalability: Handle 10M transactions per day (~116 QPS average, ~350 QPS peak).

Interview tip: Ask the interviewer: "Are we designing the gateway that communicates with card networks, or the merchant-facing payment service that integrates with an existing gateway like Stripe?" This scoping question dramatically changes the design's complexity.

Step 2: The Payment Lifecycle — A State Machine

Payments are not single events—they are state machines with defined transitions. Modeling them explicitly prevents the most common payment bugs.

StateDescriptionTrigger
CreatedPayment intent recorded; no money movedMerchant initiates payment
ProcessingRequest sent to payment service provider (PSP)System submits to card network
AuthorizedCard network approves; funds held on customer's cardPSP returns approval
CapturedFunds transferred from customer's bank to merchant's acquiring bankMerchant confirms delivery
SettledFunds deposited into merchant's accountEnd-of-day batch settlement
FailedAuthorization denied by card network or bankPSP returns decline
RefundedFunds returned to customer after captureMerchant initiates refund

Why state machines matter: A payment stuck in "processing" after a network timeout is not lost—the system knows its last known state and can query the PSP for the actual outcome. Without explicit state tracking, the system might re-submit the payment (causing a double-charge) or silently drop it (causing a lost transaction).

Interview application: "Every payment is a state machine with defined transitions. The payment service persists the current state in PostgreSQL before making any external call. If the PSP call times out, the payment remains in 'processing' state. A background reconciler queries the PSP for the actual outcome and advances the state accordingly. This eliminates the ambiguity window where the system does not know what happened."

Step 3: Idempotency — Preventing Double Charges

The fundamental challenge in payment systems: networks are unreliable. A client sends a payment request, the server charges the card, but the response is lost due to a timeout. The client retries—and without idempotency, the customer is charged twice.

Implementation:

The client generates a UUID (idempotency key) per payment intent—before any network communication. Every retry of the same intent uses the same UUID. The server checks if the idempotency key exists in the database before processing. If it exists, the server returns the stored result without re-executing. The idempotency key and ledger entry are written in the same ACID transaction—ensuring atomicity.

Three-layer enforcement (staff-level depth):

Layer 1 — API gateway: Deduplicates by client-generated UUID with 24-hour TTL. Layer 2 — Payment service: Checks idempotency key against the database before PSP submission. Layer 3 — PSP-level: Passes the idempotency key to the external gateway (Stripe, Adyen) to ensure the external charge is never duplicated even if retry logic fires.

The vulnerability window: Between the PSP charging the card and the database recording the result, a crash causes inconsistency. Close this window with a write-ahead log: persist the PSP response to a durable log before updating the database. On restart, replay the log to complete interrupted transactions.

Interview application: "Idempotency is enforced at three layers. The API gateway deduplicates by UUID. The payment service checks the database. The PSP call includes the idempotency key. The vulnerability window between PSP charge and database write is closed by a write-ahead log that replays on startup. Keys expire after 24–72 hours to bound storage growth."

Step 4: Double-Entry Ledger — Financial Accuracy

Every payment creates two ledger entries: a debit from the customer and a credit to the merchant. The sum of all ledger entries must always be zero. This is not optional—it is how production payment systems detect bugs, prevent discrepancies, and satisfy auditors.

Example: Customer pays $50 for a product.

EntryAccountDirectionAmount
1Customer (liability)Debit-$50.00
2Merchant (receivable)Credit+$50.00

If a refund is issued:

EntryAccountDirectionAmount
3Merchant (receivable)Debit-$50.00
4Customer (liability)Credit+$50.00

The sum is always zero. Any imbalance indicates a bug. Automated reconciliation checks run hourly to verify ledger balance, and alerts fire immediately on any discrepancy.

Step 5: Architecture

Token Vault (client-side SDK): Stripe.js or Braintree SDK collects raw card data in the browser, returns a single-use payment method token. Your backend never handles raw card numbers, keeping you out of PCI DSS scope for card data handling.

API Gateway: Terminates TLS, validates API keys or OAuth tokens, enforces per-merchant rate limits, and routes traffic to the payment service.

Payment Service: The core business logic. Validates the request, checks idempotency, creates the payment state machine, calls the PSP, updates the ledger, and publishes events.

PSP Integration Layer: Communicates with external payment processors (Stripe, Adyen, PayPal). Implements circuit breakers with PSP-aware routing: if one PSP fails 5 consecutive requests, traffic switches to a backup PSP—improving authorization rates from ~94% to ~97.5%.

Ledger Service: Manages the double-entry bookkeeping. Writes are ACID transactions in PostgreSQL. The ledger is the source of truth for all financial data.

Reconciliation Service: Runs daily (or more frequently) to compare internal ledger records against PSP settlement reports. Identifies and flags discrepancies for manual review.

Webhook Service: Notifies merchants of payment status changes (payment.succeeded, payment.failed, refund.created) via HTTPS POST with HMAC-SHA256 signatures. Retries failed deliveries 3 times with exponential backoff.

Interview application: "At 10M transactions per day (~116 QPS), a single PostgreSQL primary handles the write load comfortably. The complexity is not scale—it is correctness. The payment service, ledger, and idempotency key are all in the same PostgreSQL instance, enabling single-transaction ACID guarantees across all three. If we reach Visa-level load (~100K TPS), I would shard by merchant_id."

Step 6: Failure Handling

PSP Timeout

The payment service sends a charge request to the PSP and receives no response. The card may or may not have been charged.

Solution: Do not retry blindly. Query the PSP for the payment status using the idempotency key. If the PSP reports the charge succeeded, update the ledger. If the PSP reports no record, retry the charge. If the PSP is unreachable, leave the payment in "processing" state and let the reconciliation service resolve it.

Smart Retry Strategy

Not all failures should be retried. Soft declines (insufficient funds) retry 3 times with exponential backoff over 24 hours—the customer may add funds. Hard declines (stolen card, invalid number) never retry. Network timeouts query PSP status before retrying.

Multi-PSP Failover

Integrate with 2–3 PSPs simultaneously. Monitor authorization success rates per PSP. If a PSP's failure rate exceeds a threshold, route traffic to an alternative PSP automatically. This improves aggregate authorization rates and provides resilience against single-provider outages.

Step 7: Security and Compliance

PCI DSS: The most critical compliance requirement. Tokenization (Stripe.js) keeps raw card data off your servers. Required measures include annual penetration testing, MFA for admin access, quarterly vulnerability scans, and encrypted data at rest and in transit.

Encryption: TLS 1.3 for all external communication. AES-256 encryption at rest for all stored data. Encryption keys managed by AWS KMS or HashiCorp Vault—never stored alongside data.

Audit logging: Every payment state transition, every ledger entry, every administrative action is logged immutably. Logs are append-only and stored for 7+ years for regulatory compliance. When JP Morgan interviewers evaluate payment system designs, they specifically look for audit trail depth.

For structured practice on payment system design and other correctness-critical system design problems, Grokking the System Design Interview covers financial system architecture with interview-ready depth.

For advanced patterns including distributed ledgers, multi-region payment processing, and production-scale reconciliation, Grokking the Advanced System Design Interview builds the depth required for L6+ fintech interviews. The System Design Interview guide provides the broader framework for approaching any system design problem, including payment-specific trade-offs.

Frequently Asked Questions

Why is idempotency the most important concept in payment system design?

Because networks are unreliable and retries are inevitable. Without idempotency, a timeout followed by a retry charges the customer twice. Every payment request must include a client-generated idempotency key. The server stores key → result mappings and returns stored results on replay. This ensures exactly-once effects from the user's perspective.

What is a double-entry ledger and why is it mandatory?

Every payment creates two ledger entries: a debit and a credit summing to zero. This ensures financial accuracy—any imbalance indicates a bug. Automated reconciliation checks verify balance hourly. Double-entry bookkeeping is not a design choice; it is a requirement for any system handling real money.

Should a payment system prioritize consistency or availability?

Consistency. Payment systems are CP systems. A brief error message is acceptable; a double-charge or lost transaction is not. Strong consistency for ledger operations is non-negotiable. This means using ACID transactions in PostgreSQL, not eventually consistent NoSQL databases for financial records.

How does tokenization work for PCI DSS compliance?

Client-side SDKs (Stripe.js, Braintree SDK) collect raw card data in the browser before it reaches your servers. They return a single-use payment method token. Your backend processes the token—never the card number—keeping your systems out of PCI DSS scope for card data handling.

What is a payment state machine?

A model that tracks every payment through defined states: created → processing → authorized → captured → settled (or failed/refunded). Explicit state tracking prevents ambiguity during failures—a payment stuck in "processing" after a timeout is not lost; the system knows to query the PSP for the actual outcome.

How do you handle PSP timeouts in a payment system?

Never retry blindly. Query the PSP using the idempotency key to check the payment's actual status. If charged, update the ledger. If no record exists, retry the charge. If the PSP is unreachable, leave the payment in "processing" state for the reconciliation service to resolve.

What is reconciliation and why does it matter?

Reconciliation compares internal ledger records against PSP settlement reports to identify discrepancies. It runs daily or more frequently. Any mismatch triggers investigation. Reconciliation catches bugs that individual transaction validation misses—especially edge cases around timeouts and partial failures.

How do you scale a payment system beyond 100K TPS?

Shard the database by merchant_id. Use read replicas for dashboard and analytics queries. Process settlement and reconciliation asynchronously via Kafka. Implement connection pooling and batch commits. At 10M transactions/day (~116 QPS), a single PostgreSQL primary is sufficient—sharding is only needed at Visa-level volumes.

What retry strategy should a payment system use?

PSP-aware retries: soft declines (insufficient funds) retry 3x over 24 hours with backoff. Hard declines (stolen card) never retry. Network timeouts query PSP status before retrying the charge. Circuit breakers switch to backup PSPs after consecutive failures.

What are the most common payment system design mistakes in interviews?

Ignoring idempotency (causes double-charges). Using eventually consistent databases for ledger operations (causes financial inaccuracy). Treating compliance as an afterthought (PCI DSS is a requirement, not an optimization). Jumping into APIs before clarifying scope (pay-in vs pay-out changes the design dramatically). Missing the reconciliation service (how bugs get detected in production).

TL;DR

A payment gateway is a state machine where correctness trumps availability. Every payment transitions through defined states (created → processing → authorized → captured → settled) with explicit tracking that eliminates ambiguity during failures. Idempotency prevents double-charges through client-generated UUIDs checked at three layers (API gateway, payment service, PSP). Double-entry bookkeeping ensures financial accuracy—every debit has a matching credit, and the sum is always zero. Tokenization (Stripe.js) keeps raw card data off your servers for PCI DSS compliance. PSP timeout handling queries status before retrying. Multi-PSP failover with circuit breakers improves authorization rates from ~94% to ~97.5%. Reconciliation runs daily to catch discrepancies between internal records and PSP reports. At 10M transactions/day, a single PostgreSQL primary handles the load—payment complexity is correctness, not scale, until you reach Visa-level volumes (~100K TPS).

TAGS
System Design Interview
System Design Fundamentals
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
Explain Vector Database vs Inverted Index.
Learn the difference between vector databases and inverted indexes with examples, trade-offs, and interview tips. Perfect for system design and coding interview prep.
Which software engineer has the highest salary?
How can I pass Google interview?
What DocuSign interview questions to prepare leetcode?
How do I pass my coding exam?
Write‑through vs write‑back vs write‑around caching: trade‑offs?
Learn the key differences between write-through, write-back, and write-around caching. Understand their trade-offs, best use cases, and how to choose the right caching strategy for system design interviews and scalable architecture.
Related Courses
Course image
Grokking the Coding Interview: Patterns for Coding Questions
Grokking the Coding Interview Patterns in Java, Python, JS, C++, C#, and Go. The most comprehensive course with 476 Lessons.
4.6
Discounted price for Your Region

$197

Course image
Grokking Modern AI Fundamentals
Master the fundamentals of AI today to lead the tech revolution of tomorrow.
3.9
Discounted price for Your Region

$72

Course image
Grokking Data Structures & Algorithms for Coding Interviews
Unlock Coding Interview Success: Dive Deep into Data Structures and Algorithms.
4
Discounted price for Your Region

$78

Image
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.