The Watchful Guardian: A Script to Tame Failing Services

Question

Q: You have a service that sometimes enters a bad state and logs a fatal error without crashing. Write a script to monitor the service's log file, detect a specific error message, and automatically restart the service. Discuss your choice of language (e.g., Python vs. Bash).

Answer 1

"My approach would be to start with the simplest tool that works (Bash) and then explain why and when I'd upgrade to a more robust tool (Python). The key is to start simple but think about the failure modes from the beginning."

Option 1: The Bash Approach (Quick & Simple)

For a quick fix, Bash is fantastic. Its power is in the pipeline.

#!/bin/bash

LOG_FILE="/var/log/inventory_service.log"
SERVICE_NAME="inventory.service"
ERROR_PATTERN="FATAL: Connection Pool Exhausted"

# -n 0: start from the end of the file
# -F: follow file name, robust against log rotation
tail -n 0 -F "$LOG_FILE" | while read -r line; do
  # Use grep -q for a silent check. The '&&' block only runs on match.
  if echo "$line" | grep -q "$ERROR_PATTERN"; then
    echo "Detected error: '$ERROR_PATTERN'. Restarting $SERVICE_NAME..."
    # IMPORTANT: In a real scenario, add rate-limiting here!
    systemctl restart "$SERVICE_NAME"
    echo "$SERVICE_NAME restarted."
  fi
done

Strength: Simple, fast, uses standard Unix tools. It gets the job done for a basic case. Using tail -F is a key detail that handles log rotation correctly.

Weakness: This script is naive. It has no protection against a "restart loop" or "flapping." If the service fails immediately on startup, this script will happily restart it every second, hammering the system. This is a significant risk.

Option 2: The Python Approach (Robust & Extensible)

When I need more control and safety, I reach for Python. It's more verbose, but that verbosity buys us robustness.

#!/usr/bin/env python3
import subprocess
import time

LOG_FILE = "/var/log/inventory_service.log"
SERVICE_NAME = "inventory.service"
ERROR_PATTERN = "FATAL: Connection Pool Exhausted"
RESTART_COOLDOWN_SECONDS = 300  # 5 minutes

last_restart_time = 0

try:
    # Start tailing the file as a subprocess
    p = subprocess.Popen(['tail', '-n', '0', '-F', LOG_FILE], 
                           stdout=subprocess.PIPE, 
                           stderr=subprocess.PIPE)

    print(f"Watching {LOG_FILE} for errors...")

    for line in iter(p.stdout.readline, b''):
        decoded_line = line.decode('utf-8').strip()
        
        if ERROR_PATTERN in decoded_line:
            current_time = time.time()
            if (current_time - last_restart_time) > RESTART_COOLDOWN_SECONDS:
                print(f"Error detected. Restarting {SERVICE_NAME}...")
                subprocess.run(['systemctl', 'restart', SERVICE_NAME], check=True)
                last_restart_time = current_time
                print("Restart complete. Cooldown initiated.")
            else:
                print("Error detected, but in cooldown period. Skipping.")

except Exception as e:
    print(f"An error occurred: {e}")

Strength: This is a proper piece of automation. It explicitly handles the most critical failure mode: service flapping. By tracking the `last_restart_time`, it creates a cooldown period. It's also easily extensible—we could add a function to send a Slack notification or log to a central service with just a few more lines.

The Watchful Guardian: A Script to Tame Failing Services

Q: You have a service that sometimes enters a bad state and logs a fatal error without crashing. Write a script to monitor the service's log file, detect a specific error message, and automatically restart the service. Discuss your choice of language (e.g., Python vs. Bash).

❌ The Death Trap

🔄 The Reframe

🧠 The Mental Model

📖 The War Story

✅ The Answer

Option 1: The Bash Approach (Quick & Simple)

Option 2: The Python Approach (Robust & Extensible)

🎯 The Memorable Hook

💭 Inevitable Follow-ups

🔄 Adapt This Framework

Q: You have a service that sometimes enters a bad state and logs a fatal error without crashing. Write a script to monitor the service's log file, detect a specific error message, and automatically restart the service. Discuss your choice of language (e.g., Python vs. Bash).

❌ The Death Trap

🔄 The Reframe

🧠 The Mental Model

📖 The War Story

✅ The Answer

Option 1: The Bash Approach (Quick & Simple)

Option 2: The Python Approach (Robust & Extensible)

🎯 The Memorable Hook

💭 Inevitable Follow-ups

🔄 Adapt This Framework

You may also be interested in