The Busy Machine Fallacy: Why Your CPU Alerts Are Lying to You

Junior/Mid Engineer Asked at: Amazon, Meta, Google, Startups

Q: Write a Python script to check CPU usage, and if it’s above 80%, print an alert.

Why this matters: This question is a smoke test for your operational maturity. It's not about Python; it's about signal vs. noise. Can you distinguish between a server that's working hard and a server that's failing? Your answer shows whether you create tools that help on-call engineers or tools that burn them out.

Interview frequency: High. A fundamental task in system monitoring and SRE work.

❌ The Death Trap

The candidate writes a simple script that checks the CPU usage *once* and alerts. This is the most common and most wrong answer. It completely ignores the nature of time and state in computer systems.

"A naive candidate would write this:"

import psutil

if psutil.cpu_percent(interval=1) > 80:
    print("ALERT: CPU is over 80%!")

This script is actively harmful. A CPU spike is often normal—a web server handling a large request, a database running a query. This alert will fire constantly for legitimate work, creating a firehose of noise. Soon, the on-call engineer will learn to ignore it, and when a real problem occurs, it will be missed. This script creates alert fatigue and destroys trust in the monitoring system.

🔄 The Reframe

What they're really asking: "How do you build a reliable signal from a noisy metric? Show me that you understand the difference between a momentary spike and a sustained problem."

This reveals if you think in terms of state over time. It shows you respect the most valuable resource in an engineering organization: the focused attention of the on-call engineer. You're not just writing code; you're designing a system to communicate critical information to a human under pressure.

🧠 The Mental Model

I call this the **"Is the fever real?"** model. A single high temperature reading doesn't mean you're sick; you might have just run up the stairs. A doctor looks for a *sustained* fever over time.

1. Sample Continuously: Don't just take one temperature reading. Take a reading every few seconds.
2. Look at the Trend: Don't react to one high number. Collect the last N readings (e.g., the last 5 minutes).
3. Act on the Pattern: Only raise an alarm if the *average* of those readings is consistently high. This is a real fever, a real problem.

📖 The War Story

Situation: "At a video processing company, we had a fleet of worker instances that would transcode video files. This was a very CPU-intensive process, by design."

Challenge: "An early version of our monitoring system used the naive CPU check. As a result, PagerDuty would scream every single time a new video job started, because the CPU would legitimately and correctly jump to 100%. The SRE team was drowning in false positives."

Stakes: "The real danger was a bug where a job could get stuck in a loop, pegging the CPU at 100% but making no progress. Because the team was so used to seeing high CPU alerts, a stuck worker could go unnoticed for hours, wasting thousands of dollars in compute costs and violating our processing time SLAs with customers."

✅ The Answer

"My approach is to build a script that understands the difference between a spike and a trend. A single data point is noise; a pattern is a signal. Here's a robust script that implements this philosophy."

The Robust Monitoring Script

import psutil
import time
from collections import deque

# --- Configuration ---
CPU_THRESHOLD = 80.0  # Percent
CHECK_INTERVAL_SECONDS = 10 # How often to sample CPU
DURATION_MINUTES = 5      # How long the average must be high to trigger
NUM_SAMPLES = (DURATION_MINUTES * 60) // CHECK_INTERVAL_SECONDS

def monitor_cpu():
    """Monitors CPU usage and alerts if the average is sustained above a threshold."""
    print(f"Monitoring CPU... Alerting if average over {DURATION_MINUTES} mins > {CPU_THRESHOLD}%")
    
    # Use a deque for efficient adding/removing from both ends
    cpu_readings = deque([], maxlen=NUM_SAMPLES)
    
    while True:
        # psutil.cpu_percent(interval=1) is a blocking call. 
        # It compares system-wide CPU times over a 1-second interval.
        current_usage = psutil.cpu_percent(interval=1)
        cpu_readings.append(current_usage)
        
        # Only check the average if we have enough samples to be meaningful
        if len(cpu_readings) == NUM_SAMPLES:
            average_usage = sum(cpu_readings) / NUM_SAMPLES
            
            print(f"Current Usage: {current_usage:.1f}%, 5-min Avg: {average_usage:.1f}%", end='\r')
            
            if average_usage > CPU_THRESHOLD:
                # In a real system, this would call a pager, send a Slack message, etc.
                print(f"\nALERT: Sustained high CPU usage! Average over last {DURATION_MINUTES} mins is {average_usage:.1f}%")
                # Potentially add a cooldown here to avoid spamming alerts
        
        # Wait for the next check interval (minus the 1s spent in cpu_percent)
        time.sleep(CHECK_INTERVAL_SECONDS - 1)

if __name__ == "__main__":
    monitor_cpu()

Why This Is Better:

  • It's stateful: It uses a `deque` to remember recent history, turning a stateless check into a stateful monitor.
  • It's robust to spikes: A single 100% reading will barely move the 5-minute average. It requires a genuine, sustained problem to trigger the alert.
  • It's configurable: The threshold and duration aren't hardcoded magic numbers. They are explicit configuration values at the top, which is critical for tuning alerts for different services.

🎯 The Memorable Hook

This connects a technical task to the ultimate business purpose. It shows you think from the user backward, which is the foundation of good product and reliability engineering.

💭 Inevitable Follow-ups

Q: "This script monitors one machine. How would you scale this to a fleet of 1000 servers?"

Be ready: "This script is a good conceptual model, but it doesn't scale. For a fleet, you'd use a standard observability stack. Each server would run a lightweight agent like the Prometheus Node Exporter to expose metrics. A central Prometheus server would scrape these metrics, and an Alertmanager would handle the alerting logic, which would be identical to our script: `avg_over_time(node_cpu_seconds_total[5m]) > 0.8`. The script teaches the principle; Prometheus operationalizes it at scale."

Q: "What other metrics might you look at to confirm a high CPU alert is a real problem?"

Be ready: "I'd immediately correlate CPU with user-facing metrics. Is p99 request latency also increasing? Is the application error rate going up? I'd also check system-level metrics like Load Average (to see if tasks are waiting for CPU) and context switching (to see if the CPU is thrashing). A high CPU with flat latency might just be a busy but healthy server."

🔄 Adapt This Framework

If you're junior: Writing the robust script and explaining the "fever" model is a fantastic answer that puts you ahead of most candidates. You demonstrate an understanding of the core problem.

If you're senior: You should dismiss the naive approach in one sentence and spend your time on the robust solution. You should proactively bring up the follow-up questions about scaling with Prometheus and correlating with application-level metrics as part of your initial answer.

Written by Benito J D