NFS Mounts: Are You a Patient Saint or a Pragmatic Pessimist?

Mid/Senior Asked at: Google, Netflix, Oracle, Fintech Startups

Q: What is the difference between a soft mount and a hard mount in NFS, and when would you use each?

Why this matters: This isn't a trivia question about an old protocol. It's a deep probe into your understanding of distributed system failure modes. Your answer reveals whether you think in terms of trade-offs, business impact, and designing for resilience.

Interview frequency: High in SRE, DevOps, and Infrastructure roles. Appears surprisingly often in senior backend interviews where systems interact with storage.

❌ The Death Trap

95% of candidates fall into the trap of giving a dry, textbook definition. They recite the `man` page, showing they can memorize but not reason.

"A hard mount will retry an NFS request indefinitely until the server responds. A soft mount will return an error to the calling application after a timeout. You use hard for data integrity and soft when you need responsiveness."

This is correct, but utterly devoid of insight. It signals you've read the definition, not that you've grappled with the consequences.

🔄 The Reframe

What they're really asking is: "Describe your philosophy for handling an unresponsive network dependency. Do you choose to fail-stop the client, or do you pass the failure up to the application to handle?"

This reveals your ability to reason from first principles about failure domains. It shows you understand that technical choices are actually business-risk decisions in disguise.

🧠 The Mental Model

Don't think of them as mount options. Think of them as two competing philosophies of failure.

1. The Patient Saint (Hard Mount): This philosophy believes the server's state is the absolute truth. It will wait with infinite patience for the server to return. It prioritizes data consistency above all else. Its motto: "It is better to wait forever than to proceed with uncertainty." The application's life is secondary to data integrity.
2. The Pragmatic Pessimist (Soft Mount): This philosophy assumes the network is unreliable and servers can die. It has no time for waiting. It prioritizes application availability. Its motto: "An error I can handle is better than an infinite hang." It pushes the burden of dealing with failure onto the application code.
3. The Deciding Question: Ask yourself: "Where is it cheaper and safer to handle the failure? In the operating system kernel, or in my application logic?" The answer to this determines your choice.

📖 The War Story

Situation: "At a previous fintech company, we had a critical end-of-day settlement system. A cluster of worker nodes needed to read multi-gigabyte trade reconciliation files from a central storage server and write back signed confirmation files. The files were exposed to the workers via an NFS share."

Challenge: "The storage server, while reliable, was a potential single point of failure. During rare network blips or maintenance on the storage filer, the NFS server could become unresponsive for anywhere from 30 seconds to a few minutes."

Stakes: "The stakes were astronomical. If a worker read a file partially, performed a calculation, and wrote a bad confirmation, it could lead to millions of dollars in settlement errors. A silent data corruption event was the absolute worst-case scenario. A delayed settlement was acceptable; a wrong one was a company-ending event."

✅ The Answer

My Thinking Process:

"My first thought was to apply the 'Patient Saint vs. Pragmatic Pessimist' model. What's the cost of failure here? For a financial settlement, data integrity is non-negotiable. A silent data corruption is the nightmare scenario. A process that simply stops and waits—a 'fail-stop' behavior—is actually a *feature* in this context. It's a loud, obvious failure mode that prevents a much worse, subtle one. So I immediately leaned towards the 'Patient Saint' philosophy."

What I Did:

"We explicitly chose to use a `hard` mount. We configured the mount options with `intr` (on older kernels) to allow the hung process to be killable by a signal, giving us an escape hatch if needed. The entire system's design embraced this choice. We built our monitoring around detecting D-state (uninterruptible sleep) processes on the worker nodes. An alert for a hung process wasn't a sign of a bug; it was the designed-in signal that the NFS server needed immediate attention."

The Outcome:

"We never experienced a single data corruption incident from this system in two years. We did have three incidents where settlements were delayed by 5-10 minutes because processes hung due to network issues. In each case, our monitoring fired instantly, the on-call SRE fixed the underlying network problem, and the processes resumed exactly where they left off, completing their work correctly. We traded a small amount of availability for absolute data integrity, which was the correct business trade-off."

What I Learned:

"I learned that the most mature system design isn't about preventing all failures—it's about deliberately choosing your failure mode. A `hard` mount allowed us to choose a loud, simple, and safe failure (a hung process) over a quiet, complex, and catastrophic one (data corruption)."

🎯 The Memorable Hook

This reframes the choice. It's not just a technical flag; it's a decision about where you place the burden of correctness. Do you put it on the developer to write complex error-handling code for every single file I/O? Or do you put it on the operations team to build robust monitoring for a simpler, fail-stop system?

💭 Inevitable Follow-ups

Q: "What are the specific dangers of using a soft mount, especially with write operations?"

Be ready: Explain that with a soft mount, a write operation can time out. The client NFS driver might return an error, but you have no guarantee whether the write made it to the server or not before the timeout. This can lead to silent data loss or corruption if the application retries a non-idempotent write.

Q: "You mentioned `intr`. What about `nointr`? And how have things changed in modern kernels?"

Be ready: Show your deep knowledge. Mention that `intr` is deprecated and largely ignored in modern Linux kernels (post 2.6.25). Explain that hard mounts are now interruptible by default via signals, making them safer. This demonstrates your knowledge is current, not from a 10-year-old textbook.

Q: "When would you ever use a soft mount then?"

Be ready: Use cases where data loss is acceptable and availability is paramount. Think read-only mounts of non-critical data (e.g., configuration files that have a local default, or a directory of status indicators where a missed read isn't catastrophic). The key is that the application must be written to gracefully handle an I/O error on any read.

🔄 Adapt This Framework

If you're junior: You may not have a war story. Focus on the mental model. "I haven't had to make this decision in production, but here's how I would reason about it using the 'Patient Saint vs. Pragmatic Pessimist' framework. For a system like a user database, I would choose X because... For a system like a cache warmer, I would choose Y because..."

If you're senior: Expand the scope. Talk about the "blast radius." "The choice also depended on our service architecture. Since these workers were idempotent and could be safely killed and restarted by a scheduler, a hard mount's 'fail-stop' behavior was contained. In a monolithic architecture where a hung thread could cascade and take down the entire application, the risk calculation would be different."

If you lack this experience: Bridge to a concept you know. "This trade-off between failing-stop and failing-fast reminds me of choosing TCP vs. UDP. With TCP, you get guaranteed delivery, like a hard mount's integrity promise. With UDP, you get speed but have to build your own reliability in the application layer, much like a soft mount requires application-level error handling."

Written by Benito J D