Beyond Cron Jobs: Crafting a Bulletproof Server Cleanup Script
Q: "Write a shell script to automate log rotation and system cleanup, including error handling and notifications."
Why this matters: This question isn't about your bash wizardry. It's a test of your operational maturity. They want to know if you build systems that are reliable and observable, or if you create silent liabilities that will page an SRE at 3 AM.
Interview frequency: Very high. A staple for DevOps, SRE, and Platform Engineering roles.
❌ The Death Trap
The rookie mistake is to immediately start writing a simple, linear script. They list a few `find`, `rm`, and `mv` commands, showing they can solve the problem in a perfect world, but have never experienced the chaos of production.
"Most people say: 'I'd write a script that does `find /var/log -mtime +7 -delete` to clean old logs, and `find /tmp -mtime +1 -delete` for temp files, then put it in a nightly cron job.'"
This answer is a red flag. It's brittle, it has no safety checks, it fails silently, and it lacks any form of observability. A cron job that fails without anyone knowing isn't automation; it's a time bomb.
🔄 The Reframe
What they're really asking: "Can you create automation that we can trust? Show us how you would build a small, reliable system that is defensive, idempotent, and tells us when it's in trouble."
This reveals your ability to think about failure modes, race conditions, and the human element of operations. It separates a scripter from a systems engineer.
🧠 The Mental Model
I call my approach "The Guardian Script". A guardian is proactive, cautious, transparent, and always cleans up after itself. It operates on four core principles:
📖 The War Story
Situation: "At a previous e-commerce company, our application servers' disks would fill up at random, always during peak hours. This was caused by developers leaving verbose debug logging enabled."
Challenge: "The existing 'cleanup' cron job was a simple `find-and-delete` script. It would often fail silently on permission errors or if a log file was in use. The team would only know there was a problem when the entire site went down."
Stakes: "Each outage meant tens of thousands in lost revenue and a massive hit to customer trust. The engineering team was constantly in firefighting mode, unable to work on new features."
✅ The Answer
My Thinking Process:
"The core problem wasn't just 'full disks,' it was 'unpredictable, silent failure.' The solution needed to be a trustworthy system, not just a script. I decided to build a 'Guardian Script' that was robust, observable, and safe by design."
What I Did:
"First, I implemented a **lock file mechanism** using `flock`. This ensured that if a previous run was stuck, a new one wouldn't start and cause chaos. Safety first.
Second, I externalized all configuration. Paths, retention days, and disk usage thresholds were moved to a separate `/etc/janitor.conf` file. This meant we could adjust behavior without ever touching the script's logic.
Third, I structured the script with **modular functions**: `rotate_app_logs`, `cleanup_tmp_dirs`, `prune_old_docker_images`. Each function was responsible for one task and logged its start and end. I also added a master `health_check` function that aborted if disk space was already critically low.
Fourth, and most critically, I added **robust error handling and reporting**. The script started with `set -euo pipefail`. I created a `notify_error` function that would be called with the line number and error message upon failure, sending a detailed alert to our team's Slack channel. I wrapped every critical command in a block that would call this function on failure.
Finally, I added a `trap cleanup EXIT` at the top of the script. The `cleanup` function would always run on exit, removing the lock file. This guaranteed the script would never leave itself in a locked state."
The Outcome:
"Disk-related production outages dropped to zero. We started getting actionable alerts like 'Cleanup failed in `rotate_app_logs` at line 94: permission denied on /var/log/app/payment.log'. This allowed us to fix underlying permission issues proactively. The script gave us confidence, turning an unpredictable liability into a reliable utility."
What I Learned:
"I learned that the most important feature of any automation is trustworthiness. A simple script that runs reliably and screams when it's broken is infinitely more valuable than a complex one that fails silently. The goal isn't just to automate a task; it's to automate peace of mind."
🎯 The Memorable Hook
"A script without locks and logs is just a bug waiting for a cron schedule."
This one-liner shows you understand the inherent dangers of "fire-and-forget" automation and that you prioritize safety and observability in the systems you build.
💭 Inevitable Follow-ups
Q: "Why not just use the built-in `logrotate` utility?"
Be ready: "`logrotate` is a fantastic tool, and we'd absolutely use it for standard system logs like syslog or nginx access logs. However, this 'Guardian Script' serves a broader purpose. It acts as a centralized janitor for application-specific cleanup that goes beyond simple rotation—like clearing temporary directories, pruning old build artifacts, or even running `docker system prune`. It provides a single, observable, and customizable point of control for overall system hygiene."
Q: "How would you test a script like this?"
Be ready: "I'd use a two-pronged approach. For unit testing, a framework like `bats` (Bash Automated Testing System) is perfect. I'd create test cases for each function, using temporary directories and mock files to assert that the correct files are deleted or compressed. For integration testing, I'd build a Docker container with the script and a test configuration, run it, and then inspect the container's filesystem to verify the end state is what we expect. This entire process would be part of a CI pipeline."
🔄 Adapt This Framework
If you're junior/mid-level: Focus on explaining the 'why' behind `set -e`, the importance of logging your actions, and the basic concept of a lock file to prevent chaos. Show that you are thinking about what could go wrong, even if you haven't built a complex system yet.
If you're senior: Expand on making the script even more robust. Discuss emitting metrics to Prometheus (e.g., number of bytes cleaned, script duration) for long-term trend analysis. Talk about how you would deploy and manage the configuration file using Ansible or SaltStack to ensure consistency across a fleet of servers.
If you lack this experience: Bridge from a related area. "While I haven't built this exact script for a production fleet, I applied the same defensive principles to the deployment scripts in my personal CI/CD pipeline. I implemented locks to prevent concurrent deployments and added detailed Slack notifications so I would know immediately if a build artifact failed to upload..."
