The Reliability Dial: How to Answer 'How Did You Improve System Reliability?'

Senior/Staff Engineer Asked at: Google, Netflix, Uber, Startups with scale

Q: "Tell me about a time you significantly improved the reliability of a production system."

Why this matters: This isn't a question about bug-fixing. It's a test of your business acumen. They want to know if you can translate the fuzzy concept of "reliability" into a precise, data-driven conversation about trade-offs between innovation and stability. They're hiring a business owner, not just a system maintainer.

Interview frequency: Extremely high for any role that touches production.

❌ The Death Trap

The average engineer describes fixing a bug or adding more servers. This is table stakes. The real trap is failing to show a systematic, repeatable way of thinking. They answer the "what" but completely miss the "how" and "why." They describe an event, not a process.

"Most people say: 'There was a service that kept crashing. I found a memory leak and patched it. The service became more stable.'"

🔄 The Reframe

What they're really asking: "How do you make reliability a rational, economic decision? Show me you can build a system that allows business, product, and engineering to have an objective conversation about risk."

This reveals: Your ability to quantify user pain, negotiate with product on what's "good enough," and create a self-regulating system that balances feature velocity with stability. It separates engineers who react to problems from those who design systems that prevent them.

🧠 The Mental Model

This isn't about one-off fixes. It's about installing an operating system for reliability. Use the SRE-inspired **"Promise, Budget, Policy"** framework. It turns emotional debates into data-driven decisions.

1. Define Reality (The SLI): Quantify the user experience.
2. Make a Promise (The SLO): Define "good enough" with product.
3. Create a Budget (The Error Budget): Calculate your allowance for failure.
4. Enforce a Policy: Use the budget to automate the build-vs-fix decision.

📖 The War Story

Situation: "At my last e-commerce company, we had a critical image processing service for new product uploads. It was the gateway for all new inventory to appear on the site."

Challenge: "The service was 'flaky.' Some uploads failed, some were slow, but we had no numbers. The product team wanted new features like video support, while the on-call engineers were burning out from constant alerts. The two teams were in a cold war, fueled by anecdotes."

Stakes: "Merchant satisfaction was plummeting because their products weren't showing up. The business couldn't quantify the impact, and engineering couldn't justify prioritizing reliability work over features. We were paralyzed by a lack of data."

✅ The Answer

My Thinking Process:

"The core problem wasn't technical; it was a lack of a shared language. To solve the conflict, I needed to replace opinions with math. I decided to introduce the 'Promise, Budget, Policy' framework to make the invisible user pain visible and actionable."

What I Did:

"First, I **Defined Reality**. I instrumented the service to create an SLI (Service Level Indicator): the percentage of image uploads that succeeded in under 30 seconds. We finally had a real number that reflected the user experience.

Second, I **Made a Promise**. I went to the product manager and asked, 'What percentage of successful uploads is good enough to make our merchants happy?' We looked at support tickets and user feedback and landed on 99.5%. This became our SLO (Service Level Objective). This wasn't a technical goal; it was a product promise.

Third, I **Created a Budget**. Our SLO of 99.5% meant we had a 0.5% Error Budget. For every 10,000 images, we could afford to fail 50. This budget was our currency for taking risks.

Finally, I **Enforced a Policy**. I built a dashboard visible to everyone showing our SLO compliance and our remaining error budget for the month. The new rule was simple and automatic: If we have over 50% of our error budget left, product can ship features. If we drop below 50%, a feature freeze is triggered and all engineering effort goes to reliability. This wasn't my decision; it was the policy we all agreed on."

The Outcome:

"The transformation was immediate. The arguments stopped. The conversation shifted from 'Is it reliable enough?' to 'Do we have the budget to ship this?' Within a month, we were consistently meeting our 99.5% SLO. On-call pages dropped by over 80%. Six weeks later, the product team launched their video support feature with confidence, because they knew we had the error budget to cover the risk of a new deployment."

What I Learned:

"Reliability isn't a feature you build; it's a data-driven contract you make with your users. An error budget is the mechanism that turns that contract into a productive, unemotional negotiation between innovation and stability. It aligns the entire company around the only thing that matters: the user's experience."

🎯 The Memorable Hook

This elevates a technical concept into a business and product philosophy. It shows you understand that engineering decisions are economic decisions.

💭 Inevitable Follow-ups

Q: "How did you pick the right SLI? Couldn't you have just picked availability?"

Be ready: "Availability is a poor proxy for user happiness. A service can be 'up' but so slow it's unusable. I chose latency and success rate of the core user journey because that's what our merchants actually cared about."

Q: "What if the product manager had pushed for a 99.99% SLO?"

Be ready: "I would have framed it as a cost-benefit analysis. I'd ask, 'Are we willing to delay the next three features to gain that extra 0.49% of reliability?' An error budget makes trade-offs explicit."

🔄 Adapt This Framework

If you're junior: Focus on step 1. Talk about a time you added the critical instrumentation (the SLI) that allowed the team to see a problem clearly for the first time. Show that you think in terms of measurement.

If you're senior: The story above is ideal. Emphasize the cross-functional negotiation, the business impact, and how you turned this success into a template for other teams at the company.

If you lack this experience: Analyze a past outage using this framework. "We had a major outage on our checkout service. In retrospect, it happened because we had no shared definition of reliability. If I were to tackle it today, I would start by defining an SLI for checkout success..."

Written by Benito J D