Debugging the Domino Effect: From CPU Spike to API Timeout

Senior/Staff Engineer Asked at: Netflix, Uber, Stripe, AWS

Q: Walk me through how you’d debug a cascading failure: you get an alert for a high CPU spike on a service. As you investigate, you see disk usage climbing, database latency increasing, and dependent services are now reporting API timeouts. Where do you start?

Why this matters: This isn't a single problem; it's a test of your systems thinking. They want to know if you can navigate a storm of correlated failures to find the root cause, or if you'll just get lost chasing symptoms.

Interview frequency: Extremely high for SRE, Backend, and Platform roles. It separates engineers who fix bugs from those who build resilient systems.

❌ The Death Trap

95% of candidates fall into the "whack-a-mole" trap. They see four problems and try to solve them one by one, without a coherent strategy. This is reactive, not diagnostic.

"Most people say: 'First I'd run `top` to check the CPU. Then I'd see the disk is full and run `df -h` to see which partition. Then I'd check the database slow query log...' This is just a checklist. It shows you know commands, but not how to think under pressure."

🔄 The Reframe

What they're really asking is: "How do you build a narrative of cause and effect from chaotic signals? Can you find the first domino in a chain reaction, or will you just chase the falling ones?"

This reveals your ability to prioritize, form a testable hypothesis, and communicate clearly during a crisis—the core skills of a senior engineer.

🧠 The Mental Model

I call it the **SCIR Framework: Triage, Epicenter, Hypothesis, Remediation.** It moves from containing the blast radius to finding the single point of origin.

1. Triage & Stabilize (Stop the Bleeding): Your first job is not to find the root cause, it's to mitigate user impact. Can you roll back a recent deploy? Disable a feature flag? Divert traffic?
2. Find the Epicenter (Follow the Timeline): Don't treat all signals as equal. Dive into your observability platform. Which metric deviated *first*? The CPU, the disk, or the DB? The first domino tells you where to look.
3. Form a Hypothesis (Tell the Story): Based on the epicenter, construct a plausible story. "I believe a code change is writing massive temp files (disk full), which causes kernel I/O thrashing (CPU spike), which starves the DB connection pool (DB latency), which causes health checks to fail (API timeouts)."
4. Remediate & Learn (Vaccinate the System): Apply a precise fix, then ask the real question: "How do we make this entire *class* of problem impossible in the future?"

📖 The War Story

Situation: "At a fintech company, I was on-call for our core Transaction Processing service. It handled all card payments and was directly tied to revenue."

Challenge: "On a Monday morning peak, the on-call pager went off. P99 latency for our `/charge` endpoint had jumped from a stable 150ms to over 45 seconds. Our Grafana dashboard lit up: CPU on the transaction service nodes was pinned at 100%, disk usage was climbing at 1% per minute, and our Aurora PostgreSQL cluster was reporting critical replication lag."

Stakes: "Every minute of this outage meant thousands in failed payments and a catastrophic erosion of trust with our merchants. The war room was already open."

✅ The Answer

"Here's how I would apply my framework to that exact situation."

My Thinking Process (Triage & Stabilize):

"My absolute first thought isn't about debugging; it's about stopping the financial bleeding. A major new 'Risk Score Enrichment' feature, controlled by a feature flag, had been enabled an hour earlier. This is my prime suspect. The fastest path to stability is containment. I immediately announced in the war room, 'I'm disabling the Risk Enrichment feature flag to see if the system recovers. This is the least invasive, fastest-to-revert change.'"

What I Did (Find the Epicenter):

"We disabled the flag, and within 5 minutes, latency started to recover. Now the immediate fire was out, we could investigate. I pulled up our Datadog dashboards, focusing on the timeline. At 9:05 AM, the feature flag was enabled. At 9:06 AM, the 'Disk Bytes Written' metric on the transaction service VMs spiked violently. CPU utilization started climbing at 9:08 AM. DB replication lag alerts fired at 9:12 AM. The API timeout alarms were last, at 9:15 AM. The disk was the first domino."

My Thinking Process (Form a Hypothesis):

"The story became clear: The new risk feature pulls a large dataset from S3, processes it, and is likely writing large temporary files to disk for each transaction. My hypothesis: a bug in the new code path is creating these temp files but failing to clean them up. This filled the disk, which caused the OS to spend all its time on disk I/O, spiking the CPU. With the server's I/O saturated, it couldn't write to the database fast enough, causing the DB connections to stall and the replication to fall behind."

What I Did (Validate & Isolate):

"To prove it, I SSH'd into an affected instance and ran `lsof | grep '(deleted)'`. The output was a flood of open file descriptors pointing to massive, deleted temporary files, all held by our Python application process. A quick look at the commit for the new feature revealed the culprit: a developer was using a temp file library but wasn't closing the file stream in a `finally` block. If any exception occurred during processing, the file handle was leaked. Under production load, exceptions were happening, and the disk filled with orphaned files."

The Outcome (Remediate & Learn):

"The immediate fix was a one-line hotfix to wrap the file operation in a `try...finally` block. But that's not the end of the story. The real value came from the post-mortem. We implemented three systemic improvements:
1. Better Alerting: We lowered our disk space alert threshold from 90% to 70%, with a 'rate of change' alert to catch rapid increases.
2. Better Observability: We added monitoring for open file descriptors on all critical services.
3. Better Guardrails: We added a static analysis linter to our CI pipeline to automatically detect unclosed resources in new code."

🎯 The Memorable Hook

This perspective shifts the focus from a frantic, technical checklist to a methodical, narrative-driven investigation. It shows you think in terms of cause and effect, which is the hallmark of a senior engineer.

💭 Inevitable Follow-ups

Q: "What if disabling the feature flag hadn't worked? What would be your next step?"

Be ready: "My next step would be to isolate a single node. I'd take one VM out of the load balancer's rotation and stop traffic to it. This creates a safe environment for more invasive debugging—like running `strace` or attaching a profiler—without risking the entire service. It contains the blast radius of my investigation."

Q: "How could you have prevented this from ever reaching production?"

Be ready: "This failure highlights a gap in our pre-production testing. Our staging environment didn't have a realistic load profile. A proper stress test that mimicked Monday morning traffic spikes against this specific feature path would have exposed the resource leak long before it saw a single customer. It's a lesson in making your testing environment a true reflection of production reality."

🔄 Adapt This Framework

If you're junior: Focus on one piece of the puzzle. "I haven't troubleshooted a full cascading failure, but I have debugged a disk-full issue. I would start there, using tools like `df`, `du`, and `lsof` to find exactly what's consuming space and which process is responsible. I'd then pass that specific finding to the senior engineer leading the incident."

If you're senior: Expand your scope to communication and delegation. "As the incident lead, my first step is creating a dedicated Slack channel, starting a Zoom bridge, and establishing a single source of truth document. I would delegate investigation streams: one person on the application logs, one on the database, one on infrastructure metrics. My job is to synthesize their findings to build the hypothesis, not to run every command myself."

If you lack this specific experience: Use an analogy from a smaller-scale project. "I haven't seen this in a large distributed system, but the principle is the same one I used when my personal web server kept crashing. A WordPress plugin was generating uncapped cache files, filling the disk. I used the same logic: check the timeline (when did I update the plugin?), find the epicenter (disk filled with cache files), form a hypothesis, and remediate. The core diagnostic process is scale-invariant."