The Busy Machine Fallacy: Why Your CPU Alerts Are Lying to You
Q: Write a Python script to check CPU usage, and if it’s above 80%, print an alert.
Why this matters: This question is a smoke test for your operational maturity. It's not about Python; it's about signal vs. noise. Can you distinguish between a server that's working hard and a server that's failing? Your answer shows whether you create tools that help on-call engineers or tools that burn them out.
Interview frequency: High. A fundamental task in system monitoring and SRE work.
❌ The Death Trap
The candidate writes a simple script that checks the CPU usage *once* and alerts. This is the most common and most wrong answer. It completely ignores the nature of time and state in computer systems.
"A naive candidate would write this:"
import psutil
if psutil.cpu_percent(interval=1) > 80:
print("ALERT: CPU is over 80%!")
This script is actively harmful. A CPU spike is often normal—a web server handling a large request, a database running a query. This alert will fire constantly for legitimate work, creating a firehose of noise. Soon, the on-call engineer will learn to ignore it, and when a real problem occurs, it will be missed. This script creates alert fatigue and destroys trust in the monitoring system.
🔄 The Reframe
What they're really asking: "How do you build a reliable signal from a noisy metric? Show me that you understand the difference between a momentary spike and a sustained problem."
This reveals if you think in terms of state over time. It shows you respect the most valuable resource in an engineering organization: the focused attention of the on-call engineer. You're not just writing code; you're designing a system to communicate critical information to a human under pressure.
🧠 The Mental Model
I call this the **"Is the fever real?"** model. A single high temperature reading doesn't mean you're sick; you might have just run up the stairs. A doctor looks for a *sustained* fever over time.
📖 The War Story
Situation: "At a video processing company, we had a fleet of worker instances that would transcode video files. This was a very CPU-intensive process, by design."
Challenge: "An early version of our monitoring system used the naive CPU check. As a result, PagerDuty would scream every single time a new video job started, because the CPU would legitimately and correctly jump to 100%. The SRE team was drowning in false positives."
Stakes: "The real danger was a bug where a job could get stuck in a loop, pegging the CPU at 100% but making no progress. Because the team was so used to seeing high CPU alerts, a stuck worker could go unnoticed for hours, wasting thousands of dollars in compute costs and violating our processing time SLAs with customers."
✅ The Answer
"My approach is to build a script that understands the difference between a spike and a trend. A single data point is noise; a pattern is a signal. Here's a robust script that implements this philosophy."
The Robust Monitoring Script
import psutil
import time
from collections import deque
# --- Configuration ---
CPU_THRESHOLD = 80.0 # Percent
CHECK_INTERVAL_SECONDS = 10 # How often to sample CPU
DURATION_MINUTES = 5 # How long the average must be high to trigger
NUM_SAMPLES = (DURATION_MINUTES * 60) // CHECK_INTERVAL_SECONDS
def monitor_cpu():
"""Monitors CPU usage and alerts if the average is sustained above a threshold."""
print(f"Monitoring CPU... Alerting if average over {DURATION_MINUTES} mins > {CPU_THRESHOLD}%")
# Use a deque for efficient adding/removing from both ends
cpu_readings = deque([], maxlen=NUM_SAMPLES)
while True:
# psutil.cpu_percent(interval=1) is a blocking call.
# It compares system-wide CPU times over a 1-second interval.
current_usage = psutil.cpu_percent(interval=1)
cpu_readings.append(current_usage)
# Only check the average if we have enough samples to be meaningful
if len(cpu_readings) == NUM_SAMPLES:
average_usage = sum(cpu_readings) / NUM_SAMPLES
print(f"Current Usage: {current_usage:.1f}%, 5-min Avg: {average_usage:.1f}%", end='\r')
if average_usage > CPU_THRESHOLD:
# In a real system, this would call a pager, send a Slack message, etc.
print(f"\nALERT: Sustained high CPU usage! Average over last {DURATION_MINUTES} mins is {average_usage:.1f}%")
# Potentially add a cooldown here to avoid spamming alerts
# Wait for the next check interval (minus the 1s spent in cpu_percent)
time.sleep(CHECK_INTERVAL_SECONDS - 1)
if __name__ == "__main__":
monitor_cpu()
Why This Is Better:
- It's stateful: It uses a `deque` to remember recent history, turning a stateless check into a stateful monitor.
- It's robust to spikes: A single 100% reading will barely move the 5-minute average. It requires a genuine, sustained problem to trigger the alert.
- It's configurable: The threshold and duration aren't hardcoded magic numbers. They are explicit configuration values at the top, which is critical for tuning alerts for different services.
🎯 The Memorable Hook
"The goal of monitoring is not to measure the health of a machine. It's to measure the quality of the user's experience. High CPU is just a clue, not a conclusion."
This connects a technical task to the ultimate business purpose. It shows you think from the user backward, which is the foundation of good product and reliability engineering.
💭 Inevitable Follow-ups
Q: "This script monitors one machine. How would you scale this to a fleet of 1000 servers?"
Be ready: "This script is a good conceptual model, but it doesn't scale. For a fleet, you'd use a standard observability stack. Each server would run a lightweight agent like the Prometheus Node Exporter to expose metrics. A central Prometheus server would scrape these metrics, and an Alertmanager would handle the alerting logic, which would be identical to our script: `avg_over_time(node_cpu_seconds_total[5m]) > 0.8`. The script teaches the principle; Prometheus operationalizes it at scale."
Q: "What other metrics might you look at to confirm a high CPU alert is a real problem?"
Be ready: "I'd immediately correlate CPU with user-facing metrics. Is p99 request latency also increasing? Is the application error rate going up? I'd also check system-level metrics like Load Average (to see if tasks are waiting for CPU) and context switching (to see if the CPU is thrashing). A high CPU with flat latency might just be a busy but healthy server."
🔄 Adapt This Framework
If you're junior: Writing the robust script and explaining the "fever" model is a fantastic answer that puts you ahead of most candidates. You demonstrate an understanding of the core problem.
If you're senior: You should dismiss the naive approach in one sentence and spend your time on the robust solution. You should proactively bring up the follow-up questions about scaling with Prometheus and correlating with application-level metrics as part of your initial answer.
