SLA, SLO, SLI: The Language of Promises, Goals, and Reality

Senior Engineer Asked at: Google, FAANG, Stripe

Q: Can you explain the relationship between Service Level Agreements (SLAs), Objectives (SLOs), and Indicators (SLIs)?

Why this matters: This is the foundational question of Site Reliability Engineering. The interviewer is testing if you can connect engineering work to business promises. Your answer reveals whether you think about reliability as a vague ideal or as a quantifiable product feature that must be engineered and budgeted for.

Interview frequency: Guaranteed in any SRE interview. Very high in senior backend and platform roles.

❌ The Death Trap

The candidate gives dry, academic definitions that they've memorized from the Google SRE book without any real-world context or strategic insight.

"An SLI is a metric, like availability. An SLO is a target for that metric, like 99.9%. And an SLA is a contract with a customer that says what will happen if you miss the SLO."

This is correct but shallow. It shows you can define the terms, but not that you understand how they form a powerful system for making rational business and engineering trade-offs.

🔄 The Reframe

What they're really asking: "How do you create a shared language between engineering, product, and the business to make intelligent decisions about risk and velocity? How do you quantify reliability so it's no longer an emotional debate but a data-driven one?"

This elevates the conversation from definitions to strategy. It's about building a common model of reality that allows the entire organization to move in a coherent direction.

🧠 The Mental Model

The "Personal Finance" model. Managing reliability is like managing your money. It's a system of promises, goals, and cold, hard facts.

1. The SLA is a Loan Agreement. It's a legally binding contract you make with an external party (the customer, the bank). If you fail to meet its terms (miss a payment), there are severe, real-world consequences (financial penalties, loss of trust). You absolutely cannot breach it.
2. The SLO is your Personal Budget. It's a strict, internal goal you set for yourself. You budget to save more money than your loan payment requires. This gives you a safety buffer. Breaching your budget is a serious warning sign that requires you to change your behavior (e.g., code freeze), but it doesn't immediately trigger external penalties.
3. The SLI is your Bank Statement. It's the raw, objective truth. It's the data that tells you, at any given moment, how you are performing against your budget (SLO) and your loan agreement (SLA). It's the feedback from reality.

📖 The War Story

Situation: "I joined a B2B SaaS company where the engineering and sales teams were in a state of low-grade conflict. We were a reliability-driven business."

Challenge: "The sales team was selling aggressive 99.95% uptime SLAs to close big enterprise deals. Meanwhile, the engineering team was 'flying blind.' We had no real-time data on our actual reliability and were constantly terrified of breaching these promises we had no say in. 'Reliability' was an emotional argument, not a data-driven conversation."

Stakes: "We were frequently paying out SLA penalties, which directly hit our margins. Worse, the engineering team was in a constant state of firefighting and fear, which stifled innovation. We couldn't take risks because we didn't know how reliable we actually were."

✅ The Answer

My Thinking Process:

"The core problem was that we lacked a shared language and a shared model of reality. My mission was to introduce the SRE framework to turn our emotional debates into rational, economic decisions."

What I Did: Building the Framework

1. First, We Found Reality (The SLI): Before we could set goals or make promises, we had to know where we stood. I led the effort to instrument our application to expose a clear SLI. We defined it as 'the percentage of successful API requests served in under 500ms over a rolling 28-day window.' We built a dashboard that showed this number, our 'bank statement,' in real-time. We discovered our actual availability was hovering around 99.8%.

2. Then, We Set Our Own Bar (The SLO): With the data in hand, we could have a real conversation. We set an internal SLO of 99.9% availability. This was our 'personal budget'—aspirational but achievable. Crucially, this gave us an 'error budget' of 0.1%. This budget was the number of failures we were *allowed* to have per month. It gave us permission to take calculated risks.

3. Finally, We Informed the Promise (The SLA): Armed with our SLI data and our SLO, I went back to the sales and product teams. I could say, with data, 'We are currently operating at 99.8% and are committed to a 99.9% target. We can confidently sign SLAs for 99.8%, but the 99.95% promise is not something we can keep without significant investment.' The conversation changed overnight from 'engineering isn't working hard enough' to 'what is the right reliability target for our business?'"

The Outcome:

"The impact was profound. Sales started selling realistic SLAs based on our proven capabilities. Our SLA penalty payouts dropped to zero in the next quarter. Most importantly, engineering was empowered. We used our error budget to ship features faster. If the budget was healthy, we could approve risky deploys. If it was running low, we'd freeze features and focus on reliability work. It turned reliability into a simple, quantitative dial that the entire business could understand."

What I Learned:

"I learned that reliability isn't an absolute; it's a product feature with a cost. 100% reliability is infinitely expensive and not what our customers were actually asking for. The SLA/SLO/SLI framework is the tool that allows you to find the perfect balance between innovation and stability, turning a source of conflict into a source of strategic alignment."

🎯 The Memorable Hook

This makes the concepts distinct, memorable, and connects them to a deeper truth about promises and reality.

💭 Inevitable Follow-ups

Q: "You mentioned an 'error budget.' Can you explain what that is and how you use it to make decisions?"

Be ready: "The error budget is simply 100% minus your SLO. For a 99.9% SLO, our error budget was 0.1% of requests per month. This is the acceptable amount of failure. We treated it like a financial budget. If the budget was full, the product teams were free to 'spend' it on risky feature launches that might cause some instability. If a bad deploy caused us to 'spend' the entire budget, an automatic rule kicked in: a feature freeze for the rest of the month, with all engineering effort focused on reliability. It's a self-correcting system that balances velocity with stability."

Q: "How do you choose a good SLI? What makes a metric a good indicator?"

Be ready: "A good SLI is one that accurately reflects the user's experience of happiness. A simple 'server is up' metric is a bad SLI, because the server can be up but serving 100% errors. A good SLI is user-centric. For our API, we chose a combination of availability (was the request successful?) and latency (was it fast enough to be useful?). It has to be a direct proxy for customer pain."

Written by Benito J D