Designing a High-Throughput Payment Gateway
I recently came across this topic and wanted to understand how payment gateways are built — especially ones that handle thousands of transactions per second. Clients call the API to charge a customer. Networks are unreliable, so clients retry. How do you avoid double-charging? How do you scale? What does the architecture look like?
This post is my attempt to learn and document how such systems are designed. I am not building one; I am trying to understand the patterns and tradeoffs that engineers use when they do.
The Core Problem: Exactly-Once in a World of At-Least-Once
Distributed systems do not give you exactly-once delivery. You get at-most-once or at-least-once. Payment processing requires exactly-once semantics — a charge must happen once and only once, regardless of how many times the client sends the request.
The solution is idempotency. Every charge request carries a client-generated idempotency key. If the system has already processed a request with that key, it returns the original result. If not, it processes it and stores the result. The client can retry as many times as it wants. The outcome is always the same.
This sounds simple. It is not.
High-Level Architecture
The system has three layers:
Ingress — API Gateway handles TLS termination, request validation, and rate limiting per client. The ALB distributes traffic across the charge service fleet.
Processing — The charge service is the core. It validates the request, enforces idempotency, calls the downstream payment service provider, and returns the result. It runs on ECS Fargate for predictable scaling without managing instances.
Post-processing — After a charge succeeds, async work (ledger entries, webhook notifications, reconciliation) flows through SQS FIFO queues to maintain ordering guarantees per merchant.
The Idempotency Layer
This is the most critical component in the system. Every design decision flows from one requirement: a charge with the same idempotency key must produce the same result, no matter how many times it is submitted.
The Request Lifecycle
The flow has three tiers of lookup:
-
Redis cache — Hot path. Most retries hit within seconds. A cache hit returns the stored response immediately without touching the database. This handles the majority of retry traffic.
-
Database row lock — If the cache misses, we check Aurora. The idempotency key is the primary key of the
idempotency_requeststable. We useSELECT ... FOR UPDATEto acquire a row-level lock. -
Insert and process — If no row exists, we insert one with
status = in_progressand proceed to call the payment service provider. This insert acts as a distributed lock. Any concurrent retry will see thein_progressrow and receive a409 Conflict, telling the client to retry later.
The Idempotency Table
CREATE TABLE idempotency_requests (
idempotency_key TEXT PRIMARY KEY,
client_id TEXT NOT NULL,
request_hash TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'in_progress',
response_code INTEGER,
response_body JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL DEFAULT now() + INTERVAL '24 hours'
);
CREATE INDEX idx_idempotency_expires ON idempotency_requests (expires_at)
WHERE status != 'in_progress';Three details that matter:
request_hash — The system hashes the request body and stores it alongside the idempotency key. If a client reuses an idempotency key with a different request body, the request is rejected with a 422. This prevents accidental key reuse from causing silent data corruption.
expires_at — Idempotency records expire after 24 hours. A background job (Lambda on a schedule) prunes expired rows. Without expiry, the table grows without bound.
status = in_progress — This is the distributed lock. If the charge service crashes after inserting but before completing, a reaper job detects stale in_progress rows (older than 5 minutes) and marks them as failed, allowing the client to retry cleanly.
Calling the Payment Service Provider
The downstream call to Stripe, Adyen, or whichever PSP the merchant has configured is the most failure-prone part of the system. Networks fail. PSPs have outages. Timeouts are ambiguous — did the charge go through or not?
The critical case is the timeout. The system does not know if the PSP processed the charge. It cannot retry blindly — that risks double-charging. It cannot assume failure — that risks lost revenue.
The approach: mark the record as psp_uncertain and enqueue it for reconciliation. A separate worker queries the PSP's API (using the idempotency key, which is forwarded to the PSP) to determine the actual outcome. Most PSPs support idempotency on their end, which makes this retriable and safe.
The API returns a 202 Accepted to the client with a status: pending payload. The client can poll or wait for a webhook.
Scaling for Throughput
Thousands of transactions per second is not a small number, but it is not an unusual one for payment systems. The bottleneck is almost always the database.
Database Strategy
Aurora PostgreSQL in Multi-AZ provides a writer instance for mutations and reader replicas for read-heavy idempotency lookups. The Redis cache in front absorbs the majority of retry reads, so the database primarily handles writes and cache misses.
Key decisions:
Connection pooling — ECS tasks connect through RDS Proxy. Without it, thousands of short-lived connections from Fargate tasks would overwhelm the database connection limit. RDS Proxy multiplexes connections and handles failover transparently.
Partitioning — The idempotency_requests table is range-partitioned by created_at (monthly). Pruning old records becomes a fast partition drop instead of a slow DELETE over millions of rows.
Write optimization — Ledger entries are batched using SQS FIFO queues. The charge service writes the idempotency record synchronously (the client is waiting), but ledger entries, analytics events, and webhook dispatches happen asynchronously. This keeps the hot path fast.
Auto-Scaling
The charge service scales on CPU utilization. The ledger workers scale on SQS queue depth — if the queue is growing, more consumers are needed. API Gateway enforces per-client rate limits so that one misbehaving client cannot starve others.
Observability
A payment system without observability is a liability. When a merchant reports a missing charge, you need to trace the exact path of that request in seconds, not hours.
Typical observability stack:
- Structured logging — Every log line includes
idempotency_key,client_id,trace_id, andpsp_reference. CloudWatch Logs with JSON formatting. - Distributed tracing — AWS X-Ray traces a request from API Gateway through the charge service to the PSP call and back. The trace ID propagates through SQS to the async workers.
- Metrics — CloudWatch custom metrics for charge success rate, PSP latency percentiles (p50, p95, p99), idempotency cache hit rate, and
psp_uncertaincount. Alarms fire on anomalies. - Audit trail — Every state transition of an idempotency record is logged to an append-only audit table. This is non-negotiable for PCI compliance and dispute resolution.
Common Pitfalls (From What I've Read)
Starting with standard SQS instead of FIFO. Standard queues do not guarantee ordering. For ledger entries and reconciliation, ordering matters. FIFO queues with message group IDs (keyed on merchant ID) are the right choice from the start.
Treating reconciliation as an afterthought. Reconciliation is how you recover from PSP timeouts and ambiguous failures. It should be designed early, not bolted on later.
Underestimating the importance of contract testing. PSP APIs change. Rate limit behaviors change. Error response formats change. Contract tests against PSP sandbox environments catch these before production.
Summary
A payment gateway is not primarily a throughput problem. It is a correctness problem. The system must be correct under every failure mode: client retries, PSP timeouts, database failovers, service crashes mid-transaction.
Idempotency is the foundation. Everything else — caching, scaling, async processing — is built on top of the guarantee that a charge request with the same key always produces the same result. Get that right, and the rest is engineering. Get it wrong, and no amount of throughput will save you.