The Busy Machine Fallacy: Why Your CPU Alerts Are Lying to You

Question

Q: Write a Python script to check CPU usage, and if it’s above 80%, print an alert.

Answer 1

"My approach is to build a script that understands the difference between a spike and a trend. A single data point is noise; a pattern is a signal. Here's a robust script that implements this philosophy."

The Robust Monitoring Script

import psutil
import time
from collections import deque

# --- Configuration ---
CPU_THRESHOLD = 80.0  # Percent
CHECK_INTERVAL_SECONDS = 10 # How often to sample CPU
DURATION_MINUTES = 5      # How long the average must be high to trigger
NUM_SAMPLES = (DURATION_MINUTES * 60) // CHECK_INTERVAL_SECONDS

def monitor_cpu():
    """Monitors CPU usage and alerts if the average is sustained above a threshold."""
    print(f"Monitoring CPU... Alerting if average over {DURATION_MINUTES} mins > {CPU_THRESHOLD}%")
    
    # Use a deque for efficient adding/removing from both ends
    cpu_readings = deque([], maxlen=NUM_SAMPLES)
    
    while True:
        # psutil.cpu_percent(interval=1) is a blocking call. 
        # It compares system-wide CPU times over a 1-second interval.
        current_usage = psutil.cpu_percent(interval=1)
        cpu_readings.append(current_usage)
        
        # Only check the average if we have enough samples to be meaningful
        if len(cpu_readings) == NUM_SAMPLES:
            average_usage = sum(cpu_readings) / NUM_SAMPLES
            
            print(f"Current Usage: {current_usage:.1f}%, 5-min Avg: {average_usage:.1f}%", end='\r')
            
            if average_usage > CPU_THRESHOLD:
                # In a real system, this would call a pager, send a Slack message, etc.
                print(f"\nALERT: Sustained high CPU usage! Average over last {DURATION_MINUTES} mins is {average_usage:.1f}%")
                # Potentially add a cooldown here to avoid spamming alerts
        
        # Wait for the next check interval (minus the 1s spent in cpu_percent)
        time.sleep(CHECK_INTERVAL_SECONDS - 1)

if __name__ == "__main__":
    monitor_cpu()

Why This Is Better:

It's stateful: It uses a `deque` to remember recent history, turning a stateless check into a stateful monitor.
It's robust to spikes: A single 100% reading will barely move the 5-minute average. It requires a genuine, sustained problem to trigger the alert.
It's configurable: The threshold and duration aren't hardcoded magic numbers. They are explicit configuration values at the top, which is critical for tuning alerts for different services.

The Busy Machine Fallacy: Why Your CPU Alerts Are Lying to You

Q: Write a Python script to check CPU usage, and if it’s above 80%, print an alert.

❌ The Death Trap

🔄 The Reframe

🧠 The Mental Model

📖 The War Story

✅ The Answer

The Robust Monitoring Script

Why This Is Better:

🎯 The Memorable Hook

💭 Inevitable Follow-ups

🔄 Adapt This Framework

Q: Write a Python script to check CPU usage, and if it’s above 80%, print an alert.

❌ The Death Trap

🔄 The Reframe

🧠 The Mental Model

📖 The War Story

✅ The Answer

The Robust Monitoring Script

Why This Is Better:

🎯 The Memorable Hook

💭 Inevitable Follow-ups

🔄 Adapt This Framework

You may also be interested in