The Watchful Guardian: A Script to Tame Failing Services
Q: You have a service that sometimes enters a bad state and logs a fatal error without crashing. Write a script to monitor the service's log file, detect a specific error message, and automatically restart the service. Discuss your choice of language (e.g., Python vs. Bash).
Why this matters: This is a practical test of craftsmanship. Can you write a small, useful piece of automation? Do you understand the tools of your trade (shell, scripting languages)? Do you think about failure modes and robustness, even in a simple script?
Interview frequency: Very high for SRE, DevOps, and backend roles where you're expected to manage the services you build.
❌ The Death Trap
The trap is to write a "clever" but brittle one-liner, or a script that lacks basic safety checks. The candidate focuses only on the happy path and doesn't consider what could go wrong with their automation.
"Most people write something like: tail -f /var/log/app.log | grep 'FATAL' | xargs systemctl restart app. This is deeply flawed. What if 'FATAL' appears in a non-error context? What if the service starts flapping (restarting every second)? This script is a loaded gun pointed at your production environment."
🔄 The Reframe
What they're really asking: "Can I trust you to build a small, automated system? Show me you can think through a problem, choose the right tool for the job, and build in the necessary safeguards."
This reveals your operational maturity. It's not about knowing the most obscure Bash commands; it's about demonstrating a philosophy of building simple, reliable, and understandable automation.
🧠 The Mental Model
I approach this with a simple control loop: **Observe, Decide, Act (Safely).** It's a fundamental pattern for any automated system.
📖 The War Story
Situation: "At a previous company, we had a legacy third-party inventory service. It was a black box, but critical. Occasionally, it would exhaust its database connection pool but wouldn't crash. It would just stop processing inventory updates and start logging `FATAL: Connection Pool Exhausted`."
Challenge: "A proper fix required a vendor patch that was months away. Manually restarting it was tedious and required someone to be watching dashboards 24/7. We needed a tactical, automated solution to keep the service healthy."
Stakes: "Every minute the service was in this zombie state, our website would show incorrect stock levels, leading to overselling and angry customers. We needed a reliable, automated guardian."
✅ The Answer
"My approach would be to start with the simplest tool that works (Bash) and then explain why and when I'd upgrade to a more robust tool (Python). The key is to start simple but think about the failure modes from the beginning."
Option 1: The Bash Approach (Quick & Simple)
For a quick fix, Bash is fantastic. Its power is in the pipeline.
#!/bin/bash
LOG_FILE="/var/log/inventory_service.log"
SERVICE_NAME="inventory.service"
ERROR_PATTERN="FATAL: Connection Pool Exhausted"
# -n 0: start from the end of the file
# -F: follow file name, robust against log rotation
tail -n 0 -F "$LOG_FILE" | while read -r line; do
# Use grep -q for a silent check. The '&&' block only runs on match.
if echo "$line" | grep -q "$ERROR_PATTERN"; then
echo "Detected error: '$ERROR_PATTERN'. Restarting $SERVICE_NAME..."
# IMPORTANT: In a real scenario, add rate-limiting here!
systemctl restart "$SERVICE_NAME"
echo "$SERVICE_NAME restarted."
fi
done
Strength: Simple, fast, uses standard Unix tools. It gets the job done for a basic case. Using tail -F is a key detail that handles log rotation correctly.
Weakness: This script is naive. It has no protection against a "restart loop" or "flapping." If the service fails immediately on startup, this script will happily restart it every second, hammering the system. This is a significant risk.
Option 2: The Python Approach (Robust & Extensible)
When I need more control and safety, I reach for Python. It's more verbose, but that verbosity buys us robustness.
#!/usr/bin/env python3
import subprocess
import time
LOG_FILE = "/var/log/inventory_service.log"
SERVICE_NAME = "inventory.service"
ERROR_PATTERN = "FATAL: Connection Pool Exhausted"
RESTART_COOLDOWN_SECONDS = 300 # 5 minutes
last_restart_time = 0
try:
# Start tailing the file as a subprocess
p = subprocess.Popen(['tail', '-n', '0', '-F', LOG_FILE],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
print(f"Watching {LOG_FILE} for errors...")
for line in iter(p.stdout.readline, b''):
decoded_line = line.decode('utf-8').strip()
if ERROR_PATTERN in decoded_line:
current_time = time.time()
if (current_time - last_restart_time) > RESTART_COOLDOWN_SECONDS:
print(f"Error detected. Restarting {SERVICE_NAME}...")
subprocess.run(['systemctl', 'restart', SERVICE_NAME], check=True)
last_restart_time = current_time
print("Restart complete. Cooldown initiated.")
else:
print("Error detected, but in cooldown period. Skipping.")
except Exception as e:
print(f"An error occurred: {e}")
Strength: This is a proper piece of automation. It explicitly handles the most critical failure mode: service flapping. By tracking the `last_restart_time`, it creates a cooldown period. It's also easily extensible—we could add a function to send a Slack notification or log to a central service with just a few more lines.
🎯 The Memorable Hook
"The purpose of automation isn't to save you keystrokes; it's to save you from making a critical decision at 3 AM. A good script doesn't just act, it acts wisely."
This elevates the answer from a simple coding exercise to a discussion on the philosophy of reliable automation. It demonstrates that you think about the human factors and risks involved in letting machines make decisions.
💭 Inevitable Follow-ups
Q: "Your script is great, but how would you run this in production so it survives a reboot?"
Be ready: "I'd never just run it in the background with `&`. The correct way is to manage it with a process supervisor. I would write a simple `systemd` unit file for this script, setting `Restart=always`. This ensures the OS manages its lifecycle, handles logging, and brings it back up if the host machine reboots."
Q: "What if the error message spans multiple lines?"
Be ready: "That's a great point and a limitation of line-by-line processing. Bash would struggle here significantly. In Python, you could implement a small state machine. You'd read lines into a buffer, and if you see the start of an error (e.g., a line with 'FATAL'), you'd keep reading subsequent lines until you find the end of the error pattern or a timeout is reached. This is a perfect example of where Python's expressiveness clearly beats a simple shell pipeline."
🔄 Adapt This Framework
If you're junior: Providing the working Bash script and explaining its basic function is a solid answer. Acknowledging its limitations (like flapping) shows you're thinking ahead, even if you don't write the Python version on the spot.
If you're senior: You should lead with the Python version or immediately mention the flapping problem with the Bash version. Your answer should proactively address the follow-up questions about daemonization and multi-line errors. You're expected to think about the entire lifecycle and robustness of the automation.
If you're asked to pseudo-code: Use the Python example's logic. Don't worry about exact syntax. Just write out the steps: `start tail process`, `loop forever`, `read a line`, `if pattern in line`, `check if cooldown period has passed`, `if yes, restart service and update last_restart_time`.
