Designing a Payment Processing System at Scale: Idempotency, Double-Spend Prevention & Settlement
Payment systems are the most unforgiving domain in distributed systems — a single duplicated charge or missed transaction can cause legal liability, chargebacks, and user trust loss. This guide covers the complete architecture of a production payment system: from idempotency keys to settlement reconciliation.
TL;DR — Core Principles
"Every payment operation must be idempotent (safe to retry), consistent (exactly-once execution), and auditable (double-entry ledger). The architecture must tolerate network failures, PSP outages, and clock skew without ever double-charging or losing a transaction."
Table of Contents
- Requirements & Scale Estimation
- High-Level Architecture
- Idempotency — The Foundation of Safe Payments
- Double-Spend & Race Condition Prevention
- Checkout Saga — Distributed Transaction Pattern
- PSP Integration & Webhook Handling
- Ledger & Double-Entry Accounting
- Settlement & Reconciliation
- PCI DSS Compliance Architecture
- Scaling to Millions of Transactions
- Design Checklist & Conclusion
1. Requirements & Scale Estimation
Before designing, anchor on realistic numbers. A mid-sized e-commerce platform handles:
Functional Requirements
- Process payment charges, refunds, partial refunds, and cancellations
- Support multiple payment methods: cards, bank transfers, wallets, BNPL
- Handle webhook callbacks from PSPs (payment service providers)
- Maintain a complete audit trail for every financial event
- Perform daily settlement and reconciliation with PSPs
- Detect and prevent fraud in real time
Scale Estimates
| Metric | Value | Notes |
|---|---|---|
| Peak TPS | 5,000 transactions/sec | Black Friday peaks |
| Daily transactions | ~10M | 115 avg TPS |
| Idempotency key storage | ~5GB/day (24h TTL) | Redis |
| Ledger entries/year | ~7 billion rows | 2 entries per txn (debit/credit) |
2. High-Level Architecture
The payment system is composed of loosely coupled microservices, each owning a specific domain, communicating through a Kafka event bus for resilience and auditability.
Core Services
- Payment Service: Orchestrates charge, refund, and cancel flows. Checks idempotency, calls PSP, writes to DB, publishes events.
- PSP Adapter: Wraps Stripe/Adyen/Braintree API calls with retry logic, timeout handling, and webhook signature verification.
- Ledger Service: Maintains immutable double-entry accounting records for every financial movement.
- Settlement Service: Runs nightly batch to reconcile PSP settlement reports against ledger entries.
- Fraud Service: Scores transactions in real time using ML features (velocity, device fingerprint, behavioral patterns).
- Notification Service: Sends payment confirmation, failure, and refund emails/SMS.
3. Idempotency — The Foundation of Safe Payments
Idempotency is non-negotiable in payment systems. Network failures are common: mobile apps lose connectivity mid-request, clients time out and retry, load balancers health-check endpoints. Without idempotency, every retry is a potential duplicate charge.
How Idempotency Keys Work
The client generates a UUID v4 idempotency key per payment intent and includes it in every request:
POST /v1/payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json
{
"amount": 4999,
"currency": "USD",
"customer_id": "cust_abc123",
"payment_method_id": "pm_visa_xxx"
}
The server stores the key → response mapping in Redis with a 24-hour TTL. On every incoming request, it checks Redis first:
// Pseudocode — idempotency check in Payment Service
String key = "idempotency:" + customerId + ":" + idempotencyKey;
String cachedResponse = redis.get(key);
if (cachedResponse != null) {
return deserialize(cachedResponse); // return same result, no re-processing
}
// Acquire distributed lock to prevent concurrent duplicate processing
try (DistributedLock lock = lockService.acquire(key, 30_000)) {
// Double-check after acquiring lock
cachedResponse = redis.get(key);
if (cachedResponse != null) return deserialize(cachedResponse);
PaymentResult result = processPayment(request);
redis.setex(key, 86400, serialize(result)); // 24h TTL
return result;
}
Idempotency Key Design Rules
- ✅ Keys must be generated client-side (not server-side) to survive client crashes
- ✅ Scope to customer + operation:
charge:{customer_id}:{uuid} - ✅ Store response body, not just status — clients may need the full transaction ID
- ✅ Reject requests where the same key is used with different request bodies (conflict → 422)
- ✅ Apply idempotency to refunds and cancellations too, not just charges
- ✅ PSP webhook processing must also be idempotent — PSPs deliver webhooks at least once
4. Double-Spend & Race Condition Prevention
Double-spend occurs when two concurrent requests for the same payment both succeed. This can happen when a client retries too aggressively while the original request is still processing.
Distributed Lock + Optimistic Locking
Two-layer defense: Redis distributed lock for cross-instance coordination, plus database-level optimistic locking (version column) as the final safety net:
-- Database check with optimistic locking
UPDATE payment_intents
SET status = 'processing', version = version + 1
WHERE id = :id
AND status = 'pending' -- only process once
AND version = :expected_version; -- optimistic lock
-- If 0 rows updated → concurrent request already processing → return 409
Payment State Machine
Strict state transitions prevent invalid operations:
- pending → processing → succeeded | failed | cancelled
- Only
pendingpayments can be charged; reject all other states - Once
succeeded, a new refund intent is created — never modify the original charge - State transitions are written atomically with the PSP result in the same DB transaction
5. Checkout Saga — Distributed Transaction Pattern
A checkout involves multiple services: inventory, payment, and fulfillment. You can't use a 2PC (two-phase commit) across microservices. Instead, use the Saga pattern with compensating transactions:
Choreography Saga — Checkout Flow
- Order Service → create order (status: pending) → publish
order.created - Inventory Service → reserve stock → publish
inventory.reservedorinventory.reservation_failed - Payment Service → charge card → publish
payment.succeededorpayment.failed - Fulfillment Service → create shipment → publish
shipment.created - Notification Service → send confirmation email
Compensations: If payment fails → Inventory Service listens to payment.failed and releases the reservation. If fulfillment fails → Payment Service issues refund.
Outbox Pattern — Guaranteed Event Delivery
Never publish Kafka events directly from the payment handler — if the app crashes after the DB write but before the Kafka publish, the event is lost. Use the Outbox pattern:
// Inside DB transaction (atomic)
BEGIN TRANSACTION;
UPDATE payments SET status = 'succeeded' WHERE id = :id;
INSERT INTO outbox_events (aggregate_id, event_type, payload)
VALUES (:id, 'payment.succeeded', :json_payload);
COMMIT;
// Separate Debezium CDC connector reads outbox table
// → publishes to Kafka → marks event as published
6. PSP Integration & Webhook Handling
Payment Service Providers (Stripe, Adyen, Braintree) are external dependencies with their own failure modes. Your integration must handle PSP timeouts, network errors, and asynchronous callbacks robustly.
PSP Call Resilience Pattern
- Timeout: Set aggressive timeouts (2–5s for charges, 10s for refunds). Never wait indefinitely.
- Retry with idempotency: On timeout/5xx, retry with the same idempotency key — PSP will deduplicate. Max 3 retries with exponential backoff.
- Unknown state: If you receive a timeout and can't confirm success or failure, set payment to
unknownstate. A reconciliation job resolves it within minutes by polling the PSP. - Circuit breaker: If PSP error rate exceeds 50% over 30 seconds, open circuit and show "payment temporarily unavailable."
- Multi-PSP failover: Route to backup PSP (e.g., Stripe → Adyen) when primary circuit is open.
Webhook Processing
PSPs deliver webhooks asynchronously for events like charge.succeeded, charge.failed, refund.created. Webhooks may arrive out of order and may be delivered multiple times:
- Verify webhook signature (HMAC-SHA256) before processing — reject unverified events
- Store processed webhook IDs in Redis (TTL 7 days) — skip if already processed
- Respond 200 immediately, process asynchronously via queue — never block the webhook receiver
- Handle out-of-order delivery: use
created_attimestamps; ignore older events for the same payment
7. Ledger & Double-Entry Accounting
Every financial system requires an immutable audit trail. The double-entry ledger records every money movement as two equal and opposite entries, ensuring the books always balance.
-- Ledger schema (append-only, never updated or deleted)
CREATE TABLE ledger_entries (
id BIGSERIAL PRIMARY KEY,
account_id UUID NOT NULL, -- customer, merchant, or system account
txn_id UUID NOT NULL, -- links debit and credit entries
entry_type ENUM('debit','credit') NOT NULL,
amount BIGINT NOT NULL, -- cents, never floats
currency CHAR(3) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metadata JSONB
);
-- Example: customer pays merchant $49.99
INSERT INTO ledger_entries VALUES
(uuid, customer_acct, txn_id, 'debit', 4999, 'USD', now()), -- customer -$49.99
(uuid, merchant_acct, txn_id, 'credit', 4999, 'USD', now()); -- merchant +$49.99
Key Ledger Design Rules
- ✅ Store amounts in the smallest currency unit (cents, pence, paise) — never use floats for money
- ✅ The ledger is append-only — never UPDATE or DELETE ledger rows
- ✅ Refunds are new credit/debit pairs — not modifications to original entries
- ✅ Every ledger entry references a
txn_idthat links its debit/credit pair - ✅ Partition by
created_atmonth for query performance on large datasets - ✅ Running balance = SUM(credits) - SUM(debits) for an account — recompute rather than store (prevents drift)
8. Settlement & Reconciliation
PSPs batch-settle funds to merchant bank accounts on a schedule (T+1 or T+2). Settlement reconciliation ensures your ledger matches the PSP's settlement report — discrepancies indicate missing transactions, processing errors, or fraud.
Reconciliation Pipeline
- Download PSP settlement report (CSV/JSON via SFTP or API) at end of settlement period
- Parse and normalize PSP transaction IDs, amounts, fees, and statuses
- Match each PSP entry against your internal payment records by PSP transaction ID
- Identify discrepancies:
- PSP charged but no internal record → potential ghost charge
- Internal record but no PSP entry → payment may not have processed
- Amount mismatch → fee calculation error or currency conversion issue
- Auto-resolve known patterns (e.g., authorization holds that expired)
- Flag unresolved discrepancies for manual finance team review
- Generate reconciliation report with match rate, total settled, and open items
Handling PSP Fees
PSPs charge interchange fees (typically 1.5–3% + fixed fee). Your ledger must account for fees separately: the gross amount credited minus the PSP fee equals the net settlement. Store fee amounts per transaction to enable accurate revenue reporting.
9. PCI DSS Compliance Architecture
PCI DSS (Payment Card Industry Data Security Standard) mandates how cardholder data is stored, transmitted, and processed. The primary strategy is to avoid storing cardholder data at all.
Tokenization Strategy
- Use PSP-hosted payment forms (Stripe Elements, Adyen Drop-in) — raw card numbers never touch your servers
- The PSP tokenizes the card and returns a payment method token (e.g.,
pm_visa_xxx) - Store only tokens — never store PAN, CVV, or full card numbers in your database
- This reduces your PCI scope from SAQ D (most complex) to SAQ A (simplest)
- Encrypt tokens at rest using AES-256; rotate encryption keys annually
- Network-level: TLS 1.2+ on all payment endpoints; mTLS between internal services
10. Scaling to Millions of Transactions
Database Scaling
- Shard payments by customer_id: Hash-based sharding distributes load evenly; keeps all transactions for a customer on one shard (locality for reconciliation)
- Read replicas: Route reporting and reconciliation queries to read replicas — never hit the write primary for analytics
- Ledger partitioning: Partition ledger_entries by month; old partitions become immutable and can be archived to cold storage
- CQRS for balance queries: Maintain a materialized balance table updated by ledger events; avoids expensive SUM queries on every balance check
Fraud Detection at Scale
Real-time fraud scoring must add minimal latency (<100ms) to the payment flow. Architecture:
- Feature store (Redis) pre-computes velocity features: transactions per hour per card, per IP, per device
- ML model (XGBoost/LightGBM) scores the transaction using pre-computed features — inference in <20ms
- Rule engine applies configurable thresholds: block score > 0.9, 3DS challenge score 0.7–0.9, allow < 0.7
- Offline: retrain model weekly on labeled fraud/not-fraud data from manual review queue
11. Design Checklist & Conclusion
Payment systems reward defensive engineering. Every shortcut in reliability manifests as a chargeback, a regulatory audit, or an angry customer. Before going live, validate:
Payment System Production Checklist
- ☐ Every write endpoint implements idempotency key checking with Redis
- ☐ Payment state machine enforces valid transitions (no charge of an already-succeeded payment)
- ☐ Outbox pattern used for all Kafka event publishing (no lost events)
- ☐ PSP calls have timeout, retry, and circuit breaker logic
- ☐ Webhook handlers verify signatures and process idempotently
- ☐ Ledger is append-only and uses integer cents for all amounts
- ☐ Daily reconciliation pipeline runs and alerts on unmatched transactions
- ☐ No raw card data (PAN, CVV) stored anywhere in your system
- ☐ Fraud scoring runs in <100ms and doesn't block payment processing
- ☐ Load tested to 2× peak TPS with idempotency under concurrent retry storm
A production payment system is one of the most complex distributed systems you'll build — not because of algorithmic complexity, but because the failure modes are financial and legal. Start with a single PSP integration and add resilience layers incrementally. Use managed services (Stripe, Adyen) for PSP integration to avoid reinventing payment rails. Focus your engineering effort on idempotency, the ledger, reconciliation, and fraud — these are the layers that differentiate reliable payment infrastructure.