The Idempotency Question: Your Secret Weapon for Acing System Design

Senior/Staff Asked at: Stripe, Shopify, Amazon, Netflix, FinTech

Q: "You're designing a payment API. A client might send the same `create_payment` request twice due to a network timeout. How do you prevent the customer from being charged twice?"

Why this matters: This isn't a vocabulary quiz. This is a probe into your understanding of real-world distributed systems. The internet is unreliable. Can you build systems that protect user data and the company's reputation from that chaos?

Interview frequency: Guaranteed at any company that moves money or manages critical state.

❌ The Death Trap

Most candidates give the sterile, textbook definition. They show they passed a computer science class, not that they've ever dealt with a real production outage caused by duplicate events.

"Most people say: 'An operation is idempotent if it can be applied multiple times without changing the result beyond the initial application. So for a POST request, we'd need to make it idempotent...'"

🔄 The Reframe

What they're really asking: "Have you ever felt the panic of a double-charging bug? Do you viscerally understand that 'at-least-once' delivery is the default state of the universe, and that you must proactively build safety mechanisms to achieve 'exactly-once' processing?"

This reveals: Your seniority. Junior engineers build for the happy path. Senior engineers build for the storm. This question separates them.

🧠 The Mental Model

Think of it as the "Elevator Button" principle.

1. The First Push: You push the elevator button. The system's state changes from 'idle' to 'called'. A light comes on. The request is processed.

2. The Impatient Retries: You, or someone else, pushes the same button again. The system recognizes it has already processed this *exact* request.

3. The Safe No-Op: The system does *not* call a second elevator. It simply acknowledges the request and ensures the light stays on. The result is the same, no matter how many times you push the button. The first push changes the world; the rest are ignored.

📖 The War Story

Situation: "At a previous e-commerce company, we launched a new mobile app for our yearly flash sale. The app was built by a third-party and had very aggressive retry logic on its API calls."

Challenge: "On the day of the sale, our checkout service's database had a minor latency spike—we're talking 300ms instead of 100ms. The mobile app's short timeout saw this as a failure and immediately retried the `POST /api/orders` request. But our server had already received and was processing the first request."

Stakes: "We double- and sometimes triple-charged over 800 customers in 15 minutes before we shut it down. The financial cost of refunds was around $50,000, but the damage to our brand's trust was immeasurable. Our customer support team was flooded, and Twitter was on fire. It was an all-hands-on-deck crisis."

✅ The Answer

My Thinking Process:

"The core problem wasn't the client retrying; that's expected in a mobile environment. The problem was our API was fragile because it treated every single `POST` request as a brand new, unique instruction to create an order and charge a card. I realized we needed to shift from processing *requests* to processing *intentions*. The user's intention was 'create one order,' and we needed to honor that, no matter how many duplicate requests we received."

What I Did:

"I led the post-mortem and implemented a fix using an `Idempotency-Key`. The solution had two parts. First, the client API library was updated to generate a unique UUIDv4 for every logical user action, which it sent in the `Idempotency-Key` HTTP header. Second, I modified our backend order service. When a request came in, it would:
1. Extract the `Idempotency-Key` from the header.
2. Check a Redis cache for this key.
3. If the key was **missing**, we'd start a database transaction, create the order, charge the card, and then—critically—save the HTTP response (e.g., `201 Created` with the order ID) in Redis with that key and a 24-hour TTL.
4. If the key was **present**, we'd immediately stop processing and return the saved response from Redis. This prevented any duplicate database writes or payment gateway calls."

The Outcome:

"Duplicate order incidents dropped to zero. This became a core principle of our API design. We even exposed this requirement in our public developer documentation, which increased our partners' confidence in our platform's reliability. The pattern was so successful it was adopted across all our state-changing write APIs."

What I Learned:

"I learned that resilience isn't about preventing failures; it's about making your system behave correctly when they inevitably happen. Idempotency isn't just a fancy word; it's a fundamental contract between a client and a server to ensure safety in a chaotic network environment. You have to design for chaos."

🎯 The Memorable Hook

"A retry without idempotency is a prayer. A retry *with* idempotency is an engineering guarantee."

This shows you see beyond the mechanism to the philosophy. You understand you are in the business of creating certainty and trust, which is the foundation of all great engineering.

💭 Inevitable Follow-ups

Q: "Why Redis? What are the trade-offs of storing the key in your main database vs. a cache?"

Be ready: Talk about performance and isolation. Redis is extremely fast for this kind of check. Using the main DB adds load to your critical path and can be slower. The trade-off is consistency; if Redis fails, you might lose the key. You can mitigate this with a multi-layered check or by accepting a tiny risk window.

Q: "How does this concept apply to HTTP methods like GET, PUT, and DELETE?"

Be ready: Explain that GET, PUT, and DELETE are *semantically* idempotent by definition in the HTTP spec. `GET /users/123` always returns user 123. `PUT /users/123` with the same body sets the user to that state, no matter how many times you send it. `POST` is the classic non-idempotent method, as it's used for creation, which is why it needs special handling.

🔄 Adapt This Framework

If you're junior: Focus on the core logic. "I'd make sure the client sends a unique ID for the transaction, like a `transaction_id`. On the backend, before creating a payment record, I'd first query the database to see if a payment with that `transaction_id` already exists. If it does, I'd return success without creating a new one."

If you're senior: Talk about the edge cases. "A simple cache-and-check works, but you have to consider race conditions. What if two identical requests arrive at the same time before the key is written? You need a distributed lock or an atomic 'set-if-absent' operation in your cache. You also need a strategy for garbage collecting old keys and deciding on an appropriate TTL based on business requirements."

If you lack direct payment experience: Apply it to another domain. "I haven't built a payment system, but I faced a similar issue with a job processing queue. We had to ensure that if a worker picked up a job and died before acknowledging it, another worker could safely re-process it without causing duplicate side effects. We solved it by making each job processor idempotent based on the unique job ID."