The Data Hoarder's Fallacy: Architecting Insight, Not Just Ingesting Data

Principal Engineer Asked at: Netflix, Stripe, Datadog

Q: Your observability costs have ballooned to 40% of your total infrastructure budget. Design a strategy to optimize these costs without losing critical visibility.

Why this matters: This is a question about economic architecture. The "40%" figure is a deliberate exaggeration to signal a deep, systemic problem. The interviewer is testing if you can move beyond simple cost-cutting tactics and re-architect a system based on the principle of *value*. This separates senior engineers from principal-level thinkers.

Interview frequency: High for Principal SRE, Staff, and FinOps-focused roles.

❌ The Death Trap

The candidate lists a few disconnected cost-saving techniques. They see the problem as a technical one to be solved with features, not as a cultural and architectural one.

"We could introduce sampling on our traces. For logs, we can drop the debug-level logs in production and use a cheaper storage tier. For metrics, we can look for high-cardinality labels and try to remove them."

This is a weak, tactical response. While these actions are correct, they are presented as a random checklist. It lacks a unifying philosophy and fails to address the root cause: a culture of indiscriminate data collection.

🔄 The Reframe

What they're really asking: "How do you shift an organization's observability philosophy from 'collect everything just in case' to 'collect data with intent'? How do you build a system and a culture that treats the value of data as a first-class citizen, not an afterthought?"

This reframes the problem from cutting costs to increasing the ROI of every piece of data ingested. It's an economic and cultural transformation, enabled by technology.

🧠 The Mental Model

The "Zen Librarian vs. The Digital Hoarder" model. Your observability platform is a library, not a landfill.

1. The Hoarder (The Problem State): This is the organization spending 40% of its budget. It keeps every scrap of paper, every log line, every metric, believing it might be useful someday. The result is a library so full of noise that finding a valuable book is impossible, and the storage cost is bankrupting the city.
2. The Librarian (The Desired State): The librarian doesn't keep everything. They meticulously curate the collection based on value. They know which books are essential, which can be archived, and which are yesterday's newspapers that should be discarded. Their library is smaller, cheaper, and infinitely more useful.
3. The Core Principle: The goal is not to have more data, but to have better data. We must shift our focus from the *volume* of data to the *value* of the insights it provides.

📖 The War Story

Situation: "I was at a high-growth startup where the mantra was 'move fast.' For years, the unwritten rule for observability was 'when in doubt, log it out.' Our Datadog bill was growing faster than our revenue."

Challenge: "We hit an inflection point where our observability platform, meant to be our source of clarity, became a source of chaos. It was costing us a fortune, yet during incidents, engineers complained about 'drowning in data' and being unable to find the signal. Our MTTR was actually increasing despite our massive spend."

Stakes: "The CFO mandated a 50% reduction in our observability budget. The engineering team panicked, believing this would leave them blind. This created a classic standoff: Finance saw an unacceptable cost center, while Engineering saw an existential threat to reliability."

✅ The Answer

My Thinking Process:

"My first principle was that this wasn't a cost problem; it was a value problem. The 40% budget was the symptom; the disease was a culture of treating data as free. My strategy wasn't to simply cut costs, but to introduce economic thinking into our observability practices. We needed to become Zen Librarians."

What I Did: The Three-Phase Curation Process

1. Triage the Collection (Identify Waste):
We started with the biggest offenders.

  • Metrics Cardinality: I built a dashboard that didn't show system performance, but the 'cost' of our metrics. We found a single metric with a `request_id` label that was responsible for 20% of our entire metrics bill. It was a smoking gun. We worked with the team to remove the unbounded label.
  • Log Levels: We discovered that 30% of our services were shipping verbose `DEBUG` logs to production. A simple, central change to our logging library configurations to enforce `INFO` level in prod had an immediate impact.
  • Trace Sampling: We were using head-based sampling, keeping 10% of all traces. This meant we were throwing away 90% of our data, including potentially 90% of our error traces. We shifted to tail-based sampling, where we analyze 100% of traces and make an intelligent decision at the end: keep 100% of traces with errors, and only 1% of the successful 'happy path' traces. This gave us *better* visibility for *less* cost.

2. Define the Curation Policy (Shift Left):
The triage fixed the immediate bleeding. The long-term cure was cultural.

  • Observability as a Feature: I introduced a mandatory 'Observability' section in our design docs. For any new feature, engineers had to define their 'observability contract': What are the 3-5 SLIs that define success? What specific, structured log events are needed to debug this feature? This forced them to think about data with intent.
  • Cost Visibility: We gave teams a dashboard showing the observability cost of *their specific services*. This created accountability. When a team saw their service was costing $50k/month in logs, they became motivated to find efficiencies. It aligned incentives.

3. Archive, Don't Hoard (Intelligent Tiering):
Finally, we recognized that not all data has the same value over time. We implemented a multi-tier strategy.

  • Hot Tier: The last 7 days of logs and traces are kept in our primary, expensive, real-time system for active debugging.
  • Warm Tier: Data from 7 to 90 days is downsampled and moved to a cheaper, queryable object storage solution. It's slower, but available for deeper analysis.
  • Cold Tier: Data older than 90 days, required for compliance, is archived to Glacier Deep Archive. It's incredibly cheap, but recovery is slow. This matched the access patterns to the cost of storage.

The Outcome:

"By implementing this strategy, we reduced our observability spend by over 60%, taking it from 40% of our infra budget down to a manageable 15%. But the most important result was that our Mean Time To Recovery (MTTR) actually decreased by 25%. By eliminating the noise, engineers could find the signal faster during incidents. We had less data, but more insight."

What I Learned:

"I learned that in observability, you are what you measure. A culture that measures and rewards data volume will become a data hoarder. A culture that measures and rewards the value of insights will become a library. The technical tools are secondary; the primary task is to change the economic incentives that shape how engineers think about the data they produce."

🎯 The Memorable Hook

This is a classic Naval-style reframing. It takes a common assumption ("data is the new oil") and inverts it, showing a deeper, first-principles understanding of value.

💭 Inevitable Follow-ups

Q: "How do you get buy-in from hundreds of developers to change their logging and metrics habits?"

Be ready: "You don't get buy-in through mandates; you get it by aligning incentives. The key was giving teams ownership and visibility of their own costs. We created a 'Top 10 Most Expensive Metrics' leaderboard. No one wanted to be on it. We also made it easy to do the right thing by providing centralized, well-documented logging libraries with cost-effective defaults. We made the paved road the path of least resistance."

Q: "With aggressive sampling and tiering, how do you protect against losing the one piece of data you'll need for a future, unknown problem?"

Be ready: "That's the fundamental trade-off. You can't eliminate that risk entirely, but you manage it intelligently. Our tail-based sampling is risk-aware—it never throws away an error. Our tiering is about access latency, not deletion; the data is still there, just slower to get. The real answer is that the cost of keeping everything 'just in case' is a guaranteed, massive expense, while the risk of needing a specific log from 6 months ago in an emergency is a low-probability event. We are making a rational, economic bet."

Written by Benito J D