The Ghost in the Network: What Happens When Kubernetes' Routing Rules Vanish?

Staff/Principal Engineer Asked at: Google, Microsoft, Amazon, Cloudflare

Q: Walk me through what happens inside kube-proxy when IPVS rules vanish mid-traffic.

Why this matters: This is not a trivia question. It's a surgical probe into your understanding of the deepest principle of Kubernetes: the reconciliation loop. It separates engineers who *use* Kubernetes from those who *understand* it at a first-principles level. Your answer reveals if you can debug silent, baffling failures in a complex distributed system.

Interview frequency: Rare, but a career-defining question at the highest levels.

❌ The Death Trap

The candidate either has no idea, or gives a vague, hand-wavy answer that shows a surface-level understanding. They focus on the symptom (traffic fails) without explaining the self-healing mechanism.

"Uh, the traffic would fail. New connections to the Service's ClusterIP would be dropped. Then, I guess kube-proxy would eventually notice and add the rules back."

This is a guess, not an explanation. It misses the *how*, the *why*, and the *how fast*. It signals a critical gap in understanding Kubernetes' core control loop philosophy.

🔄 The Reframe

What they're really asking: "Describe the mechanics and philosophy of Kubernetes' self-healing data plane. Can you reason about the behavior of a desired-state reconciliation loop when the actual state is forcibly diverged from the desired state?"

This reframes the question from a specific failure mode to a general principle of distributed systems design. It's a test of your ability to think about resilience, convergence, and the interplay between a control plane and a data plane.

🧠 The Mental Model

The "Self-Healing Road Map" model. Think of a city where the road signs are digital and can be updated centrally.

1. The API Server is the Master Map. It is the single source of truth for where all destinations (Pods) are.
2. `kube-proxy` is the Digital Cartographer. It lives on every street corner (Node). It constantly watches the Master Map for changes.
3. IPVS Rules are the Digital Road Signs. The Cartographer (`kube-proxy`) programs the local intersection (the kernel's IPVS module) with routing rules. "Traffic for 'Service A' should be sent to one of these three addresses."
4. The Question: What happens if a vandal comes along and rips out all the road signs at a single intersection? How does the city not descend into chaos?

📖 The War Story

Situation: "We were running a high-frequency trading platform on a bare-metal Kubernetes cluster. Every microsecond of latency mattered."

Challenge: "We started seeing bizarre, intermittent 'blackouts' lasting between 30 and 45 seconds. A critical pricing service would simply become unreachable from other pods on one specific node. There were no logs, no crashes, no CPU spikes. Packets would just vanish. It was a ghost in the network."

Stakes: "Every second of this blackout meant we were falling behind the market, leading to potentially millions in lost trade opportunities. It was a silent, high-urgency failure, the worst kind of problem to debug."

✅ The Answer

My Thinking Process:

"This isn't just about traffic failing; it's a story about a race condition between reality and a system's belief about reality. The key is to understand that `kube-proxy` doesn't just set rules and forget; it operates in a continuous, relentless reconciliation loop. The duration of the blackout is the answer to 'how fast is that loop?'"

The Blow-by-Blow Account of the Blackout

1. The Disappearance: At time T=0, something—a person running `ipvsadm --clear`, a rogue script, or even a kernel bug—wipes the IPVS ruleset on a single node. The digital road signs are gone.

2. The Immediate Impact: Two things happen simultaneously.
For existing, long-lived connections: These may actually *survive* for a time. The kernel's connection tracking system (`conntrack`) has already established the NAT mapping for these flows. It's like a car that already knows its full route and doesn't need to look at the signs anymore.
For new connections: This is where the failure occurs. A pod tries to connect to a Service `ClusterIP`. The packet arrives at the node's kernel. The kernel looks at its (now empty) IPVS ruleset. It has no idea where to forward this packet. The packet is dropped. It is black-holed. This is why there are no error logs; the failure is silent at the network layer.

3. The Reconciliation Loop Kicks In: `kube-proxy` is not watching the IPVS rules in real-time; that would be too inefficient. Instead, it has a configurable `--ipvs-sync-period` (e.g., default 30 seconds). This timer dictates how often the 'Cartographer' checks its work. `kube-proxy` holds the *desired state* in memory, which it builds by watching the API server for Service and Endpoint changes. On every tick of the `syncPeriod`, it does the following:

  1. It queries the kernel via the `netlink` interface to get the *actual state* of all current IPVS rules.
  2. It performs a diff between its in-memory desired state and the actual state it just read. In this case, the diff is massive: all the rules are missing.
  3. It then iterates through the diff and issues the necessary `ipvsadm` commands (again, via `netlink`) to add the missing virtual servers and real servers back into the kernel.

4. Service is Restored: At T=30s (or whatever the `syncPeriod` is), the rules are back. The next new connection that arrives at the kernel is correctly routed to a backend pod. The blackout ends. The ghost vanishes.

What I Learned & How We Fixed It:

"We eventually discovered an obscure interaction with a security agent we were running that would flush IPVS rules under certain conditions. The key insight was realizing the duration of our outage perfectly matched our `kube-proxy` sync period. We fixed it by tuning the agent, but the deeper learning was about the system's inherent resilience. The failure wasn't that the rules vanished; the success was that Kubernetes is designed with the core assumption that the data plane will drift, and it relentlessly, unemotionally corrects it."

🎯 The Memorable Hook

This frames the technical mechanism in a philosophical context, demonstrating a deep understanding of the system's core design principles.

💭 Inevitable Follow-ups

Q: "How is this different in `iptables` mode?"

Be ready: "The principle is identical, but the mechanism is different. `kube-proxy` would re-sync by rewriting the `KUBE-SERVICES` chain and other related chains in the `nat` table. `iptables` rulesets can be much larger and slower to update, which is a key reason large-scale clusters prefer IPVS. The outage would manifest the same way, but the recovery might take slightly longer due to the nature of `iptables` rule updates."

Q: "How would you debug this in real-time if you suspected it was happening?"

Be ready: "You need a loop that samples the actual state. I'd run `watch -n 1 'ipvsadm -L -n'` on the suspect node to see the rules disappear in real-time. Simultaneously, I'd run `tcpdump` listening for traffic to the `ClusterIP` to confirm packets are arriving but not being forwarded. Finally, I'd check the `kube-proxy` logs with a high verbosity level (`-v=4` or higher) to see the logs from its sync loop when it detects the discrepancy and re-programs the rules."

Written by Benito J D