The Accountant vs. The Detective: How to Count What Actually Matters in Logs

Junior/Mid Engineer Asked at: FAANG, Startups, Cloud Providers

Q: Count how many times the word "failed" appears in syslog.

Why this matters: This isn't just a syntax quiz. It's a test of precision. In system monitoring and incident response, metrics are everything. An inaccurate count can lead to a false sense of security or a panic-fueled overreaction. Your ability to generate an *accurate* number from noisy data is a core engineering skill.

Interview frequency: Very high. A foundational command-line task.

❌ The Death Trap

The candidate provides the most common, but subtly wrong, answer: `grep -c "failed" /var/log/syslog`. They mistake a line count for a word count and fail to consider edge cases. This is the accountant who just counts transactions without verifying them.

"The easy answer is `grep -c "failed" /var/log/syslog`. But what if a line says: `Login failed for user 'root'. This incident has been reported as a failed attempt.` The `-c` flag counts this line once, but the word 'failed' appears twice. This is how metrics become lies."

🔄 The Reframe

What they're really asking: "I need an accurate metric to make a decision. Are you going to give me a quick, dirty number, or are you going to give me the *correct* number? Do you account for the messiness of real-world data?"

This reveals your intellectual rigor. Do you take the question literally, or do you think about the *intent* behind the question? They want a detective who verifies the evidence, not just an accountant who tallies it.

🧠 The Mental Model

I use a simple three-step process I call "Define, Refine, Count." It ensures I'm always measuring what's actually intended.

1. Define the Target: Are we counting a string or a word? Is it case-sensitive? Clarify the exact entity. ("failed" vs "FAILED" vs "failure").
2. Refine the Search: Use `grep` flags to enforce the definition. Use `-w` for whole words, `-i` for case-insensitivity, and `-o` to isolate just the matches.
3. Count the Results: Once the search is precise, pipe the isolated matches to a counting tool like `wc -l`.

📖 The War Story

Situation: "We had a security alert trigger for an unusually high number of 'failed' SSH login attempts on a bastion host. The on-call engineer used `grep -c "failed" /var/log/auth.log` and the number was over 10,000 in an hour."

Challenge: "Panic set in. The standard procedure for a suspected brute-force attack of this magnitude was to firewall the subnet, which would cut off access for our entire engineering team in that region."

Stakes: "We were faced with a terrible choice: risk a massive security breach, or execute a scorched-earth response that would halt all development and deployment in the middle of the workday, costing thousands in lost productivity."

✅ The Answer

My Thinking Process:

"My first instinct was that the number felt wrong. An attacker that noisy would have been caught by other systems. The metric itself was likely flawed. I needed to apply the 'Define, Refine, Count' model to get the true picture."

What I Did:

"The naive approach was causing the panic:"

# Incorrect: Counts lines, not occurrences. Case-sensitive. Matches substrings.
grep -c "failed" /var/log/auth.log

"I took a more precise approach. I wanted to count the exact, case-insensitive word 'failed', and I wanted to count every single occurrence, not just the lines containing it."

This is the command that gets the true number:

# Correct: Counts every individual occurrence of the whole word "failed", case-insensitively.
grep -oiw "failed" /var/log/auth.log | wc -l

"Here's the breakdown for the interviewer:

  • grep: The tool to find the text.
  • -o: The detective's flag. It prints each match on a *new line*, isolating the evidence. This is the key to moving from a line count to a word count.
  • -i: Makes the search case-insensitive, matching 'failed', 'Failed', and 'FAILED'.
  • -w: Ensures we match the whole word, ignoring something like 'unfailed_job'.
  • /var/log/auth.log: The log file we're investigating.
  • | wc -l: We pipe the output to `wc` (word count) and use the `-l` flag to count the number of lines. Since `-o` put each match on its own line, this now gives us a perfect count of occurrences.

The Outcome:

"The new, accurate count was just 212. It turned out a misconfigured monitoring script was logging a verbose message that contained the string 'detailed_check_unfailed' on every run, which the naive `grep` was counting. We didn't have a brute-force attack; we had a noisy script. We avoided a costly, disruptive firewall change and fixed the actual problem."

What I Learned:

"Precision in measurement is a form of risk management. A bad metric is more dangerous than no metric because it gives you the confidence to make the wrong decision. I learned to never trust a number until I understand exactly how it was generated."

🎯 The Memorable Hook

The business runs on dashboards and metrics. As engineers, we are the source of truth for those metrics. Providing a precise, well-understood number is a foundational responsibility. Providing a sloppy one is an act of sabotage, intentional or not.

💭 Inevitable Follow-ups

Q: "How would you count the unique IP addresses that had a failed login?"

Be ready: "This is about composing tools. `grep 'failed' auth.log | grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort | uniq -c | sort -nr`. Explain the pipeline: find failed lines, extract IPs with regex, sort them for `uniq`, count unique IPs, and sort by the highest count."

Q: "The logs are compressed (`.gz`). How would you search them without uncompressing them first?"

Be ready: "You'd use `zgrep`, which is the `grep` equivalent for gzipped files. The syntax is identical: `zgrep -oiw "failed" /var/log/syslog.1.gz | wc -l`. It does the decompression in-memory, which is efficient."

Written by Benito J D