DevOps is a Peace Treaty. SRE is the Constitution.

Senior Engineer Asked at: Google, Netflix, Meta, Stripe

Q: "What's the difference between SRE and DevOps?"

Why this matters: This is a litmus test for strategic thinking. They don't want a dictionary definition. They want to know if you understand the fundamental tension between building new things and keeping existing things running, and how to resolve it with systems, not just meetings.

Interview frequency: Almost guaranteed in any modern backend, platform, or infrastructure role.

❌ The Death Trap

The candidate recites the most common, surface-level take: "DevOps is a culture, and SRE is an implementation of that culture."

"Most people say: 'Well, DevOps is a broad philosophy about breaking down silos between developers and operations. SRE is Google's specific, opinionated way of doing DevOps, treating operations as a software problem.'"

This answer isn't wrong, but it's weak. It's a memorized fact from a blog post. It demonstrates awareness, not deep understanding of the underlying problem.

🔄 The Reframe

What they're really asking: "How do you use data to make objective trade-offs between innovation and reliability? How do you resolve the inherent conflict between the team that wants to go fast and the team that wants to stay safe?"

This reveals if you can think in systems. It shows you understand that you can't solve human incentive problems with slogans. You need a framework with rules, metrics, and consequences.

🧠 The Mental Model

I see it as the "Race Team" model. A business is a race, and you need to go as fast as possible without crashing.

Developers The Driver Their incentive is speed. They want to ship features, accelerate, and win the race. They feel the pressure to go faster.

Traditional Ops The Brakes Their incentive is safety. They want to avoid crashes at all costs. Their default answer to "Can we go faster?" is "No, it's too risky."

DevOps The Peace Treaty The driver and the brake mechanic agree to talk to each other. They promise to share goals and work together. It's a cultural agreement, a handshake.

SRE The Race Engineer with Telemetry SRE installs sensors all over the car (SLOs). They give the driver a "risk budget" for the race (Error Budget). The engineer tells the driver: "You can push as hard as you want in the corners, as long as you don't burn through our tire budget. If you do, you have to pit for new tires (focus on reliability)."

📖 The War Story

Situation: "In my previous role at an e-commerce startup, we had a classic 'wall of confusion'. The product team was pushing hard for new features before the holiday season. The dev team was shipping code multiple times a day."

Challenge: "The operations team, whose bonuses were tied to uptime, became a massive bottleneck. They were terrified of the rapid changes and started demanding weeks of soak time and manual QA for every release. The relationship was toxic. Devs saw Ops as dinosaurs; Ops saw Devs as cowboys."

Stakes: "We were gridlocked. Feature velocity ground to a halt weeks before Black Friday, our most critical sales period. We were going to miss our biggest revenue opportunity of the year because of internal friction."

✅ The Answer

My Thinking Process:

"The problem wasn't bad people; it was bad incentives. They were fighting because they had conflicting goals. We had a 'DevOps philosophy'—we had stand-ups together, we were in the same Slack channels—but we had no shared, objective truth. We needed a constitution, not just a peace treaty."

What I Did:

1. Established an SLO (The Law): I got Product, Dev, and Ops in a room. I didn't ask "What's the ideal uptime?" I asked, "What level of availability is good enough that customers are happy and don't notice minor blips?" We agreed that the checkout service being successful 99.9% of the time was the goal. That became our Service Level Objective (SLO).

2. Calculated the Error Budget (The Currency): A 99.9% SLO means we have 0.1% of the month—about 43 minutes—as a budget for acceptable failure. This 'error budget' became our shared currency for risk. It was no longer an emotional debate; it was math.

3. Enforced the Budget (The Consequence): We made a simple deal. As long as we were within our error budget for the last 28 days, the dev team had a green light to deploy as they wished. The moment we burned through the budget, an automatic policy kicked in: all new feature deployments were frozen. The next development sprint would be 100% focused on reliability improvements until the service was stable again.

The Outcome:

"The change was immediate and profound. The fights stopped. Devs started monitoring the error budget themselves, sometimes holding back a risky release near the end of the month to preserve it. Ops moved from being gatekeepers to being advisors, helping dev teams build more reliable features. We shipped all our key features and sailed through Black Friday with 99.95% availability."

What I Learned:

"DevOps provides the 'why'—collaboration. SRE provides the 'how'—a data-driven framework. An error budget is the most powerful tool I've seen for turning an adversarial relationship into a partnership, because it aligns everyone around a single, measurable goal: spend your risk budget wisely to maximize velocity."

🎯 The Memorable Hook

"DevOps is a handshake. SRE is the legally binding contract that specifies what happens when one side squeezes too hard. It replaces opinions with data."

This framing demonstrates that you understand the pragmatic, enforcement-oriented nature of SRE. It's not just a "nice idea"; it's a system with teeth designed to solve a real business problem.

💭 Inevitable Follow-ups

Q: "How do you convince a product manager to 'spend' development time on reliability?"

Be ready: "You don't. The error budget policy does it for you. It's a pre-negotiated agreement. When the budget is burned, the contract says reliability work is the priority. This removes the emotion and makes it a clear, predictable business process."

Q: "Isn't SRE just a new title for the Ops team?"

Be ready: "Fundamentally, no. SREs are software engineers who solve operational problems with code. They have a mandate that at least 50% of their time should be spent on engineering projects—automation, tooling, performance improvements—not on operational toil. If a team is 100% reactive, it's Ops, not SRE."

🔄 Adapt This Framework

If you're junior: You may not have implemented an SRE program, but you've felt the pain. "I haven't set SLOs myself, but on my last team, we constantly debated with Ops about releases. I see now that a tool like an error budget would have given us a clear framework to make those decisions without fighting."

If you're senior: Your story should be about leading this change. Discuss how you got buy-in from leadership, how you trained teams on the concepts, and how you evolved the SLOs over time as you gathered more data on what users actually cared about.

If you lack this experience: Explain the framework as your philosophy. "My approach is grounded in the SRE principle of data-driven decision making. To resolve the conflict between speed and stability, the first step is always to define what 'good enough' reliability looks like, which is the SLO..."