Architecting Immortality: A Backup Strategy for a 50TB/Day Distributed Database

Principal Engineer Asked at: Google, Amazon, Meta, Stripe

Q: Design a backup and recovery strategy for a large-scale, distributed database. It processes 50TB of new data and transactions daily, must maintain strict ACID compliance, and is deployed across multiple geographic regions.

Why this matters: This is a question about your ability to architect for survival. It's not a quiz on backup tools; it's a test of your first-principles thinking about risk, consistency, and business continuity at massive scale. Your answer reveals whether you can build a system that is merely functional, or one that is truly resilient.

Interview frequency: High for Principal SRE, Staff Software Engineer, and Database Architect roles.

❌ The Death Trap

The candidate gives a single, tool-based answer. They focus on one mechanism without considering the different types of failure modes or the business requirements that drive the strategy.

"I would use disk-level snapshots. We'd take a snapshot of all the database nodes every night during a low-traffic period. For recovery, we would restore the volumes from the last good snapshot."

This answer is dangerously naive. It completely ignores ACID compliance across nodes, the massive data volume (nightly snapshots are too slow), and the different recovery needs (RPO/RTO).

🔄 The Reframe

What they're really asking: "How do you design a layered defense against data loss and downtime? Can you articulate the economic trade-offs between recovery time (RTO), data loss tolerance (RPO), and cost? And how do you solve the distributed consistency problem for backups?"

This reframes the problem from "how to back up" to "how to engineer resilience." It forces you to think about failure as a spectrum and to design a portfolio of solutions, each tailored to a specific risk.

🧠 The Mental Model

The "System's DNA and its Clones" model. A backup strategy for a complex system is like preserving the genetic code of a species. You need multiple, independent layers of protection.

1. Layer 1: The Living Tissue (Continuous Protection): This is the system's ability to heal minor wounds instantly. It's about recovering from recent, small-scale errors with near-zero data loss. This is your Point-in-Time Recovery (PITR).

2. Layer 2: The Perfect Clone (Consistent Snapshots): This is for recovering from major, logical corruption, like a catastrophic bug. You need a perfect, genetically identical clone of the entire organism, frozen at a single, consistent moment in time.

3. Layer 3: The Seed Vault (Air-Gapped Archives): This is your doomsday protection. It's a copy of the DNA stored in a completely separate, isolated environment (a different region or cloud), safe from extinction-level events like a regional outage or ransomware.

4. Layer 4: The Viability Test (Automated Recovery Drills): A seed that you've never tried to plant is just a hope. You must have an automated process that regularly proves you can grow a new, healthy organism from your clones and seeds. A backup that isn't tested is a fantasy.

📖 The War Story

Situation: "I was the lead SRE for a global payments platform. Our core datastore was a multi-region, sharded CockroachDB cluster, handling tens of thousands of transactions per second."

Challenge: "A new enterprise customer required us to guarantee a Recovery Point Objective (RPO) of less than 5 minutes and a Recovery Time Objective (RTO) of less than 1 hour for any failure scenario, including logical data corruption. Our existing nightly backup strategy had an RPO of 24 hours. It was a complete non-starter."

Stakes: "This was a make-or-break deal worth eight figures in annual revenue. Failure to design a compliant strategy meant losing the deal and signaling to the market that our platform wasn't ready for the enterprise."

✅ The Answer

My Thinking Process:

"The first principle here is that there is no single solution. A 'one size fits all' backup strategy for a system this complex is a recipe for disaster. My approach was to design a layered defense, with each layer optimized for a different RPO/RTO and failure type, all while respecting the sanctity of ACID compliance."

What I Did: Architecting the Layered Defense

Layer 1: Continuous Protection for a 5-Minute RPO.
To meet the tightest RPO, we needed something better than snapshots. We implemented Point-in-Time Recovery (PITR). This works by continuously archiving the database's transaction logs (the Write-Ahead Log or WAL) to cloud storage like S3. In case of a failure, we could restore the last full snapshot and then 'replay' the transaction logs up to the minute before the incident. This gave us a near-continuous backup, easily meeting the <5 minute RPO for scenarios like accidental data deletion.

Layer 2: Globally Consistent Snapshots for Logical Corruption.
The hardest problem is getting a transactionally consistent snapshot across dozens of nodes in multiple regions. A simple disk snapshot of each node at the same time is not consistent. We solved this by leveraging the database's internal transaction ordering. Our backup orchestrator would:

Pick a future global timestamp (e.g., T+60 seconds) to act as our synchronization point.
Instruct the database to perform a distributed, consistent backup targeting that exact logical timestamp.
Each node would then, independently, create a snapshot of its data *as of that timestamp* and upload it to regional cloud storage.

This process gave us hourly, ACID-compliant snapshots of the entire global cluster without pausing writes, protecting us against major logical corruption events.

Layer 3: Air-Gapped Archives for Disaster Recovery.
To protect against a full regional failure or ransomware, our consistent snapshots were not enough. We configured our cloud storage buckets to automatically replicate their contents to a bucket in a completely different, "disaster recovery" region. We then applied an Object Lock (immutability) policy to these replicated backups for 30 days. This meant even if an attacker gained control of our primary infrastructure, they could not delete or modify our DR backups. This was our seed vault.

Layer 4: Automated Recovery Drills for Confidence.
Finally, I argued that a backup strategy we don't test is a liability, not an asset. I built an automated pipeline in our CI/CD system that, every week, would:

Spin up a completely new, isolated Kubernetes cluster.
Perform a full restore of our latest consistent snapshot from the DR region.
Run a suite of data integrity tests and performance benchmarks against the restored cluster.
Tear the entire environment down and publish a success/failure report.

This didn't just test our backups; it continuously trained our team and proved to our customers and auditors that our 1-hour RTO was not a guess, but a verifiable reality.

What I Learned:

"This project taught me that a backup strategy is a product, not a task. Its features are defined by the RPO and RTO, and its customers are the future versions of ourselves trying to recover from a disaster. The most important lesson was that the value of a backup is zero until it's successfully restored. The investment in automated verification was the single most important part of the entire architecture."

🎯 The Memorable Hook

"A backup strategy isn't about saving data; it's about buying back time. Your RPO is the price you're willing to pay for your most valuable, non-renewable asset: the past."

This reframes a technical concept into a profound, first-principles insight about business, risk, and the nature of time itself.

💭 Inevitable Follow-ups

Q: "How do you manage the cost of storing petabytes of backup data across multiple regions with long retention periods?"

Be ready: "Through intelligent tiering. We don't keep all backups in hot, expensive storage. Our strategy used storage lifecycle policies. PITR logs were kept in S3 Standard for 7 days. Hourly consistent snapshots were moved to Infrequent Access after 14 days, and then to Glacier Deep Archive after 90 days for long-term compliance. This allowed us to meet our operational RTOs with hot data while dramatically reducing the cost of long-term retention."

Q: "Your RTO is one hour. Restoring 50TB+ of data in under an hour seems incredibly challenging. How do you achieve that?"

Be ready: "You're right, a simple sequential restore would fail. Our recovery process was architected for parallelism. We used tools that could restore data from hundreds of snapshot shards in parallel to hundreds of newly provisioned database nodes simultaneously. The RTO is determined by the throughput of this parallel restore, which we optimized and verified weekly with our automated drills."