Beyond the Acronyms: How SLAs, SLOs, and SLIs Create Engineering Freedom

Mid/Senior Engineer Asked at: Google, FAANG, Cloud-Native Startups

Q: Explain the difference between SLA, SLO, and SLI. How would you use them to design an alerting strategy that isn't noisy, incorporating the concept of an error budget? Mention some tools you might use.

Why this matters: This question separates engineers who just build features from those who build reliable products. It tests your ability to connect business promises to technical reality. Your answer reveals whether you view reliability as a burden or as a managed risk that enables speed and innovation.

Interview frequency: Core knowledge for SRE, DevOps, and senior backend roles. Extremely common.

❌ The Death Trap

The common mistake is to give dry, academic definitions learned from a textbook. Candidates recite the acronyms correctly but fail to connect them to a coherent philosophy of running a service.

"Most people say: 'SLI is a Service Level Indicator, a metric. SLO is a Service Level Objective, a target for that metric. And an SLA is a Service Level Agreement, a contract. For alerting, I'd set an alert if CPU is over 80%. I'd use Prometheus and Grafana.'"

This answer is technically correct but strategically empty. It shows you can memorize, not that you can think. Alerting on raw CPU is a classic sign of a noisy, ineffective strategy that leads to alert fatigue.

🔄 The Reframe

What they're really asking: "How do you create a shared language of reliability between business, product, and engineering that allows you to ship features confidently without burning out your team?"

This reveals your ability to think in terms of systems and incentives. It shows you understand that reliability is a feature that competes for resources, and you have a rational framework for making trade-offs.

🧠 The Mental Model

I think of it as a pyramid of promises, with reality at the base. Let's use an analogy: a pizza delivery service.

1 SLI: The Ground Truth A Service Level Indicator is the raw measurement. It's a fact. For our pizza place, an SLI is the actual time it took to deliver one specific pizza. It's a number, like 22 minutes. In tech, it's the latency of one request, or whether it returned a 200 or 500 error.

2 SLO: The Internal Goal A Service Level Objective is the target you set for your SLIs. It's your promise to yourself. The pizza manager sets an SLO: "99% of our pizzas should be delivered in under 30 minutes." It's ambitious but achievable and defines what "good service" means internally.

3 SLA: The External Promise A Service Level Agreement is the promise you make to your customers, with consequences. The pizza marketing says: "Your pizza in 35 minutes or it's free!" Notice the SLA is looser than the SLO. This buffer protects the business while keeping customers happy. If you break it, you pay a penalty.

📖 The War Story

Situation: "At a previous company, I worked on a critical 'Image Upload' service for our main social media application. This service was used by millions of users every hour."

Challenge: "The engineering team was paralyzed by fear. We had constant, noisy alerts firing for high CPU, memory pressure, etc. The product team wanted to ship new features like video processing, but engineering was afraid of breaking things. There was no agreement on what 'reliable enough' meant."

Stakes: "Developer velocity was near zero, and the on-call engineers were burning out from alert fatigue. We couldn't innovate, and our existing service was becoming a source of stress, not pride."

✅ The Answer

My Thinking Process:

"The core problem wasn't technical; it was a lack of shared language. We needed a system to make rational, data-driven decisions about risk. I proposed we adopt an SLO-driven approach."

What I Did:

1. Defined a User-Centric SLI: "First, I argued we should stop measuring machine health (like CPU) and start measuring user happiness. We agreed on a simple SLI: the percentage of image uploads that completed successfully (HTTP 201) and took less than 2 seconds. This directly reflected the user experience."

2. Negotiated a Realistic SLO: "I got product, business, and engineering in a room. I asked, 'Is 100% reliability our goal?' They said yes. I said, 'Great, that means we can never deploy new code again.' That reframing opened a real discussion. We settled on an SLO of 99.9% of uploads succeeding in under 2 seconds, measured over a rolling 28-day window. Everyone agreed that 1 in 1,000 uploads failing was an acceptable trade-off for being able to innovate."

3. Introduced the Error Budget: "This was the key that unlocked everything. I framed the 0.1% of acceptable failures as our 'Error Budget.' If we handled 100 million uploads in 28 days, our budget was 100,000 allowed failures. This budget became our currency for risk. Want to deploy a risky new feature? Let's spend some budget. Did a bad deploy cause a spike in failures? We've spent budget and need to freeze deploys and focus on reliability."

4. Designed SLO-Based Alerts: "With the error budget, we could finally create intelligent alerts:"

The Old, Noisy Way: Alert if CPU > 80% for 5 minutes. (Wakes you up at 3 AM for a temporary, harmless spike).
The New, Smart Way: We set up alerts based on the *burn rate* of our error budget.
- Low Urgency (Slack message): 'Warning: We have consumed 2% of our 28-day error budget in the last 6 hours. At this rate, we will exhaust the budget in 14 days.' This is an early warning, not a crisis.
- High Urgency (PagerDuty): 'Critical: We have consumed 10% of our 28-day error budget in the last hour. At this rate, we will exhaust the budget in less than 24 hours.' This is the *only* alert that wakes someone up, because it signals sustained, user-impacting failure that threatens our SLO.

5. Implemented with Tooling: "Conceptually, we implemented this using a standard observability stack:

Prometheus: Our service exposed a /metrics endpoint with counters for successful and failed uploads. Prometheus scraped this data continuously.
Grafana: We built a single, prominent dashboard. It showed our current SLO attainment, the percentage of error budget remaining, and the burn rate. This became the team's heartbeat.
(If using AWS): You could achieve the same with CloudWatch Metrics from application logs or an ALB, and create CloudWatch Alarms based on metric math that calculates your SLI and triggers SNS notifications.

The Outcome:

"The culture changed completely. Alert fatigue vanished. Product and engineering had a shared, quantitative language for discussing risk. We started shipping features again, and when our error budget got low, everyone on the team understood it was time to prioritize reliability work. We had turned reliability from a source of fear into a manageable engineering problem."

What I Learned:

"I learned that reliability isn't about preventing all failures—it's about agreeing on an acceptable number of them. An error budget isn't just a technical tool; it's a social contract that aligns the entire team and gives you the freedom to build."

🎯 The Memorable Hook

"Noisy alerts are a tax on your team's attention, one of its most valuable and finite resources. An error budget is how you stop paying that tax and invest that attention back into your product."

This reframes the problem from technical to economic. It shows you think about second-order effects like team productivity and focus, which is a hallmark of a senior-level mindset.

💭 Inevitable Follow-ups

Q: "How do you choose a good SLI? What if availability isn't the only thing that matters?"

Be ready: A good SLI is user-centric, easy to understand, and reliably measurable. You can have multiple SLIs for different aspects of your service, like availability, latency, and data correctness. For example, a video streaming service might have an SLI for 'playback start time' and another for 'rebuffering percentage'."

Q: "What happens if you're about to breach your SLA?"

Be ready: An SLA breach should be a rare, all-hands-on-deck emergency. Your SLO is your guardrail to prevent this. If you are about to breach your *SLO*, policy might dictate a code freeze. If you are about to breach your *SLA*, you might consider extreme measures like failing over to another region, disabling non-critical features, or even proactively communicating with customers.

🔄 Adapt This Framework

If you're junior: You likely didn't lead this initiative. Frame it as "I was on a team that adopted SLOs..." and talk about your role: "I was responsible for instrumenting the code to expose the SLI metrics" or "I helped build the Grafana dashboard that tracked our error budget." Show you understood the 'why' behind your tasks.

If you're senior: Emphasize the cross-functional negotiation and the strategic impact. "The biggest challenge wasn't technical; it was getting buy-in from product..." Talk about how this initiative improved planning, reduced friction, and increased developer velocity.

If you lack this experience: Talk about it hypothetically but ground it in your past work. "In my previous role, we didn't have formal SLOs, but we constantly struggled with alert noise from our payment processor. If I were to design a system for that today, I would start by defining an SLI for successful payment transactions..." This shows you can apply the concepts, even without direct experience.