The Silent Crash: When Traffic Drops But Your Alerts Don't

Senior/Staff Engineer Asked at: FAANG, Stripe, Shopify, any high-traffic business

Q: You deploy on Friday afternoon and traffic drops 30% with no error alerts. Your CEO is asking questions. What’s your investigation process?

Why this matters: This is a senior-level crucible. It tests your entire stack: technical depth, communication under pressure, and your ability to reason about a system that is failing *silently*. The "no alerts" part is the key; it proves your observability is deeper than just 5xx error rates.

Interview frequency: High. A classic behavioral scenario for SRE and senior backend roles.

❌ The Death Trap

The candidate jumps into a frantic, unstructured debugging session. They list random commands they would run without a clear, prioritized framework.

"I'd immediately SSH into the boxes and check the logs. I'd look at `top` to see if CPU is high. Maybe it's a database issue, so I'd check the connection pool. I'd probably also look at the network traffic..."

This is the panic response. It shows a lack of a systematic process and, critically, it ignores the most important first step: stopping the bleeding.

🔄 The Reframe

What they're really asking: "Do you have a repeatable, iron-clad incident response framework? Can you differentiate between stabilizing a system and diagnosing it? And how do you diagnose a failure that your monitoring system was blind to?"

This reframes the question from a technical pop quiz to a test of your leadership and systemic thinking during a crisis. It's about demonstrating poise, process, and a deep understanding of what *truly* matters: Mean Time To Recovery (MTTR).

🧠 The Mental Model

The "Silent Heart Attack" model. The system appears healthy on the surface (no errors, normal vitals), but a critical business function is failing. Your job is to be the ER doctor.

1. Acknowledge & Communicate (The Code Blue): The moment a business metric drops, declare a SEV-1 incident. You don't need a technical alert to know the patient is critical. Open a war room. Start the clock.

2. Stabilize First (Stop The Bleeding): The immediate priority is not diagnosis. It's recovery. The highest-leverage, fastest action is almost always a rollback of the last change. You don't need to know *why* the patient is bleeding to apply pressure.

3. Diagnose Second (The Autopsy): Once the patient is stable (service is restored), the investigation begins. Now, in a low-pressure environment, you can analyze the silent failure.

4. Vaccinate The System (The Postmortem): You don't just fix the bug. You fix the monitoring blindness that allowed it to go undetected. You fix the process that allowed it to be deployed. You make the entire system more resilient.

📖 The War Story

Situation: "I was leading a team for a major e-commerce platform. It was a Friday afternoon, and we'd just rolled out a seemingly minor update to our product recommendation engine."

Challenge: "Fifteen minutes post-deploy, our primary business dashboard lit up like a Christmas tree. Revenue per minute was down 30%. Critically, our technical dashboards were all green. API error rates were zero, CPU and memory were nominal. The system was, by all technical measures, perfectly healthy, yet it was failing disastrously."

Stakes: "We were in the middle of a sales event. The traffic drop translated to a six-figure revenue loss for every hour the problem persisted. The CEO was in the war room chat. The pressure was immense."

✅ The Answer

My Thinking Process:

"My first thought was, 'The absence of alerts is not the presence of health.' The business metric is the ultimate source of truth. My entire focus was on one thing: restoring the business function as fast as humanly possible. The root cause analysis could wait; the revenue could not."

What I Did: Executing the Framework

Phase 1: Stabilize (The First 5 Minutes):

I declared a SEV-1 incident and established an incident commander. My first message to the CEO and stakeholders was: "We see the impact. We are rolling back the last change now to restore service. Stand by for updates."
I executed an immediate rollback of the recommendation engine deployment. This is the highest leverage action in an incident.
Within 7 minutes of the initial alert, the rollback was complete. We watched the business dashboard, and revenue per minute snapped back to normal. The bleeding was stopped.

Phase 2: Diagnose (The Next 30 Minutes):

With the site stable, we now had the breathing room to understand the 'why'. We re-deployed the faulty code to a staging environment that mirrored production traffic.
The 'no alerts' was the key clue. I hypothesized that the API wasn't erroring; it was just returning incorrect data. We inspected the payloads. Sure enough, the recommendation API was returning a `200 OK` status, but with an empty `[]` list of products for a large segment of our users.
The logic bug was subtle: a new database query in the engine failed to handle users with no prior purchase history, but instead of throwing an exception, it gracefully returned an empty set. The product page, expecting a list of recommendations, simply rendered an empty component. No crash, no error, just... nothing.

Phase 3: Cure & Vaccinate (The Postmortem):

The Cure: The code fix was simple: add a fallback to show 'most popular products' if the new recommendation logic returned an empty set.
The Vaccine: This was the most important part. We identified two systemic failures. First, our monitoring was blind to business logic failures. We created a new, high-priority alert: `"Recommendation_API_Zero_Result_Rate > 5%"`. This turned our blind spot into a new sense. Second, our deployment process was flawed. We enhanced our canary analysis pipeline to not only monitor technical metrics (CPU, errors) but also key business metrics. Now, a canary deploy that caused a drop in 'add-to-cart' events would be automatically rolled back before it ever hit 100% of traffic.

What I Learned:

"This incident taught me that SRE isn't just about system health; it's about business health. A system that is 'up' but not creating value is just as broken as one that is down. It solidified my belief that the most important monitors are often the ones closest to the money, and that the goal of incident response isn't just to fix the problem, but to evolve the system so that entire classes of problems can never happen again."

🎯 The Memorable Hook

"An error alert tells you your system is broken. A business metric alert tells you your system is useless. A senior engineer knows the difference and builds systems to detect both."

This draws a sharp, insightful distinction that demonstrates a deep, business-oriented understanding of reliability.

💭 Inevitable Follow-ups

Q: "What if the rollback didn't work?"

Be ready: "That would be my immediate signal that my initial hypothesis ('the deploy caused it') was wrong. The next step in my framework would be to investigate external dependencies. Is a payment provider having an issue? A CDN? A third-party inventory API? A rollback failing immediately widens the scope of the investigation from 'us' to 'our entire ecosystem'."

Q: "How do you justify a 'no Friday deploys' policy? Isn't that a sign of a lack of confidence in your systems?"

Be ready: "It's a pragmatic trade-off. While a mature organization should be able to deploy confidently at any time, a 'no Friday deploys' rule is a cheap, effective insurance policy. It recognizes the human element—that incident response is harder when your team is not at full strength over the weekend. The long-term goal is to make this policy obsolete by building such a robust canary and rollback system that the risk of a Friday deploy becomes negligible. But until then, it's about managing risk intelligently."