The Sentinel's Logbook: A Python Script That Listens to Reality
Q: Can you write a Python script that monitors a directory for changes—like file creation, modification, or deletion—and logs those changes?
Why this matters: This seems like a simple coding challenge, but it's a test of your systems thinking. The interviewer wants to know if you can build a reliable agent that creates a feedback loop from a fundamental part of the operating system. Can you turn a simple requirement into a production-ready tool?
Interview frequency: High for SRE, DevOps, and Platform Engineering roles.
❌ The Death Trap
The candidate provides a quick, fragile script using a low-level library or a naive `while True` loop with `os.listdir`. They present the code without context or a story.
"Sure, I'd use the `watchdog` library. You just create an event handler and an observer, and then start it. Here's the code..."
This answer is technically functional but lacks any senior-level thinking. It solves the immediate problem but ignores reliability, robustness, and the *purpose* behind the request.
🔄 The Reframe
What they're really asking: "Can you build a reliable, event-driven agent that acts as a nervous system for a critical part of our infrastructure? Can you explain how this simple tool is actually a fundamental building block for automation, security, and data pipelines?"
This reframes the task from writing a script to architecting a solution. It's about demonstrating your ability to see a simple tool as a point of high leverage in a complex system.
🧠 The Mental Model
The "Digital Sentinel" model. This script is not a "watcher"; it's a sentinel guarding a critical asset.
📖 The War Story
Situation: "We had a critical data ingestion pipeline that relied on partners dropping CSV files into a specific SFTP directory. This was the entry point for our entire analytics workflow."
Challenge: "Our old system was a fragile cron job that ran every 15 minutes, scanning the directory. It was unreliable. If the cron failed, data was delayed for hours. If a file was half-written when the scan ran, it would ingest corrupted data. We were blind to what was happening in this critical directory between scans."
Stakes: "Data delays meant our business intelligence dashboards were out of date, leading to bad decisions. Corrupted data ingestion would poison our entire data lake, requiring days of manual cleanup. The business was losing trust in our data."
✅ The Answer
My Thinking Process:
"The core problem was our polling-based, blind approach. We needed to move to an event-driven model. We needed a 'sentinel' that would react instantly and intelligently to any change in the directory, creating a reliable feedback loop. This isn't just a script; it's the first step in an intelligent automation pipeline."
What I Did: The Sentinel's Code
"Here's the robust, production-minded Python script I would build to solve this. It's not just a script; it's a reliable service."
The Outcome:
"By replacing the cron job with this event-driven sentinel, we eliminated the 15-minute data lag. Our pipeline became real-time. The structured logs fed directly into our monitoring system, giving us an instant audit trail of every file delivery and alerting us to any unexpected deletions. We went from a reactive, failure-prone system to a proactive, highly reliable one."
What I Learned:
"I learned that the most robust systems are built on simple, efficient feedback loops. This script isn't complex, but its value is immense because it creates a direct, real-time connection between the state of the filesystem and our business logic. It's a point of high leverage."
🎯 The Memorable Hook
"A file system isn't just a place to store data; it's a dynamic event stream representing reality. A great engineer doesn't poll reality every 15 minutes; they subscribe to its updates. This script is a subscription to reality."
This connects a simple technical implementation to a profound, first-principles view of systems, information, and reality.
💭 Inevitable Follow-ups
Q: "How would you make this script production-ready to run as a long-lived service?"
Be ready: "I'd containerize it with Docker for portability. I'd manage it with `systemd` to ensure it restarts on failure. I'd add more robust error handling and metrics to track the number of events processed and any errors in the handler itself. The logs would be shipped to a central logging platform like Elasticsearch or Loki, not just printed to stdout."
Q: "What are the limitations of this approach, especially on a very busy filesystem?"
Be ready: "The underlying OS mechanisms like `inotify` have limits on the number of watches. On an extremely busy system, the event queue in the kernel can overflow, causing you to miss events. For that scale, you might need a more distributed queuing system. However, for 99% of use cases like configuration or data drops, this approach is the most efficient and reliable."
