The Domino Chain: Decoding Production Fires Before They Burn You

Senior/Staff Engineer Asked at: FAANG, Unicorns, Startups

Q: Walk me through how you’d debug a production incident where you see a high CPU spike on your service, the primary database disk is rapidly filling up, database query latency is through the roof, and dependent APIs are timing out.

Why this matters: This isn't four separate problems; it's a cascading failure. They're testing your ability to think systemically under pressure, not just your knowledge of Linux commands. Can you find the first domino that fell, or will you just chase the falling ones?

Interview frequency: Almost guaranteed in any Senior+ SRE or Backend interview.

❌ The Death Trap

95% of candidates dive into tactical, symptom-based fixes. They treat it like a checklist of four independent issues, showing a lack of systemic thinking.

"Most people say: 'First, I'd SSH into the server and run top to see the high CPU process. Then I'd check the disk with df -h and try to delete old logs. For the DB, I'd restart it. For the API, I'd increase the timeout value.'"

This is like treating a heart attack with a band-aid, an ice pack, and a painkiller. You're addressing the symptoms, not the disease, and you might kill the patient.

🔄 The Reframe

What they're really asking: "Describe your mental model for diagnosing a complex, cascading system failure while minimizing customer impact."

This reveals your composure, your grasp of cause-and-effect in distributed systems, and whether you prioritize stabilizing the system over blindly "fixing" things.

🧠 The Mental Model

I call this the **"Triage, Trace, and Treat"** framework. It moves from managing impact to finding the cause.

1. Triage (Stop the Bleeding): What is the fastest, safest action to reduce customer impact *right now*? This is about mitigation, not diagnosis. Think feature flags, routing traffic away, or scaling up a stateless service.
2. Trace (Find the First Domino): The symptoms (CPU, disk, latency, timeouts) are a chain reaction. Start with the symptom closest to the user (API timeouts) and work backward, asking "Why?" at each step until you find the source.
3. Treat (Apply the Real Cure): Address the root cause you discovered. This should be a precise, targeted fix. Then, validate that all symptoms have subsided and the system is stable.

📖 The War Story

Situation: "At my previous company, a B2C e-commerce platform, we were in the middle of a massive Black Friday flash sale. I was the on-call lead for the 'Order Processing' domain."

Challenge: "At 2:05 PM, PagerDuty lit up like a Christmas tree. Our primary order service CPU utilization was pegged at 99%. The main PostgreSQL database disk usage was climbing 1GB per minute. p99 query latency shot up from 50ms to 5000ms. And the most critical metric: the public-facing /v1/checkout API was timing out for over 60% of users."

Stakes: "We were effectively down during our highest revenue hour of the year. Every minute of downtime meant tens of thousands in lost sales and a massive hit to customer trust."

✅ The Answer

My Thinking Process (Applying the Framework):

"Okay, this is a cascading failure. The four alerts aren't four problems; they're four data points telling one story. The API timeouts are the customer pain, the DB latency is likely causing it. The high CPU and disk usage are symptoms of a struggling database. I won't touch the servers yet. First, I need to buy us breathing room."

What I Did (Triage, Trace, Treat):

1. Triage: "The absolute first thing I did was communicate in the incident Slack channel: 'Acknowledged, investigating.' Then, I asked my team to enable a pre-built feature flag that disabled a non-critical, but database-intensive feature on the checkout page: the 'real-time inventory check across all warehouses.' This immediately shed about 20% of the load. It wasn't a fix, but it bought us time."

2. Trace: "Now I followed the thread backward from the user."

  • "Why are APIs timing out? Our dashboards showed the timeouts were from requests to our PostgreSQL read replica."
  • "Why is the DB slow? I checked the database monitoring tools. The slow query log was filled with thousands of entries for one specific query related to fetching a user's order history for the loyalty program."
  • "Why is *this* query slow? I ran an EXPLAIN ANALYZE on it. It revealed a full sequential scan on our massive `orders` table. A new developer had recently added a WHERE clause on a `jsonb` field that wasn't indexed."
  • "Why the High CPU & Full Disk? The full table scan on millions of rows was consuming all the database CPU. Worse, PostgreSQL was creating huge temporary files on disk to sort and process the unindexed query, which explained why the disk was filling up at an alarming rate. That was the first domino: a single, unindexed query."

3. Treat: "The long-term fix was to refactor the feature, but we needed an immediate cure. I crafted a migration to add a GIN index to that specific jsonb field.

CREATE INDEX CONCURRENTLY orders_loyalty_data_idx ON orders USING GIN (loyalty_data);
Using CONCURRENTLY was critical to avoid locking the table during the sale. The team peer-reviewed it in 2 minutes. We deployed it. The moment the index build finished, the results were instantaneous."

The Outcome:

"Within 90 seconds of the deploy, database CPU dropped to a normal 15%, disk usage flatlined, p99 latency went back to under 50ms, and the checkout API success rate hit 99.99%. We salvaged the rest of the sales event. The key takeaway in our post-mortem was to implement automated query analysis in our CI/CD pipeline to prevent unindexed queries from ever reaching production again."

What I Learned:

"I learned that in a crisis, the most valuable currency is calm thinking. Anyone can read metrics; the skill is in synthesizing them into a single narrative of cause and effect. And the most powerful tool isn't a command-line utility, it's a feature flag that lets you gracefully degrade the service to buy yourself time to think."

🎯 The Memorable Hook

This shows you're not just a technician; you're a diagnostician. You seek understanding, not just a quick fix. It demonstrates a level of maturity that separates senior engineers from the rest.

💭 Inevitable Follow-ups

Q: "What if adding the index hadn't worked? What was your plan B?"

Be ready: Talk about failing over to a new primary, rolling back a recent deployment that might have introduced the change, or, in a worst-case scenario, taking the feature causing the query down entirely to save the core checkout flow.

Q: "How could this have been prevented?"

Be ready: Discuss proactive measures: load testing new features, automated query plan analysis in CI, better monitoring and alerting thresholds, and creating a culture of thorough code reviews for database-interacting code.

🔄 Adapt This Framework

If you're junior: You may not have been the lead. Frame it as "I was on the incident response team, and I was tasked with...". Focus on your specific role, like analyzing the logs or dashboards, while showing you understood the larger systemic picture the lead was piecing together.

If you're senior: Emphasize the communication and coordination aspects. "My first priority was to establish a clear line of communication... I delegated the dashboard monitoring to a junior engineer while I focused on the database... I kept stakeholders updated every 15 minutes." Show leadership beyond the technical fix.

If you lack this exact experience: Use a similar story about a complex bug. "This reminds me of a time we had a severe memory leak..." Then, apply the Triage, Trace, and Treat model to that scenario. The framework is universal, even if the symptoms change.

Written by Benito J D