Disaster Recovery Testing: What We Found When We Actually Tried


We had a disaster recovery plan. Detailed documentation, tested components, regular backups, multiple data center availability. Everything looked solid on paper. Our compliance auditors were happy. Leadership felt confident.

Then we actually tried to fail over our production environment to the disaster recovery site. That’s when we discovered how many assumptions were wrong, how much documentation was outdated, and how unprepared we actually were.

This wasn’t a real disaster, fortunately. We scheduled a DR test as part of our annual compliance requirements. But if it had been real, we would have been down for much longer than planned, and some data would have been lost. That’s a sobering realization.

What the Plan Said

Our documented disaster recovery plan said we could fail over to our backup data center within 4 hours with RPO (recovery point objective) of 15 minutes and RTO (recovery time objective) of 4 hours. This meant we’d lose at most 15 minutes of data and be back online within 4 hours.

The plan included detailed runbooks for each major system. Database replication configurations. Network failover procedures. Application deployment steps. Contact lists for key personnel. It was 60 pages of documentation that someone had spent significant time creating.

On paper, it looked comprehensive. In annual reviews, we’d walk through the documentation, verify that backup systems were running, confirm that replication was working. Everything checked out.

Nobody had actually tried to use the disaster recovery site to run production workloads in over two years.

Test Day

We scheduled the DR test for a Saturday to minimize business impact. The plan was to fail over at 8am, verify everything was working, run for a few hours, then fail back to primary systems by 2pm. Six hours total, most of which was supposed to be verification rather than recovery.

We started following the runbooks at 8am. The first problem appeared within 20 minutes. Step 3 of the database failover runbook referenced a script that didn’t exist. Someone had updated the process but not the documentation.

We found the correct script through institutional knowledge. One of the DBAs remembered the change. This worked because we were testing. If we’d been in a real disaster with the primary site unavailable and that DBA unreachable, we would have been stuck.

The Replication Lag Problem

Our databases used streaming replication to the DR site. We monitored replication lag continuously. It was always showing near-zero lag, which made us confident that failing over would be clean with minimal data loss.

During the test, we discovered that “near-zero lag” was measuring the wrong thing. The replication stream was current, but there were transactions that hadn’t replicated because of configuration issues we didn’t understand. When we failed over, we were missing about 30 minutes of recent data, not the 15 minutes our RPO promised.

This is the kind of thing you don’t discover until you actually try to use the backup system. The monitoring showed what we expected to see, but the monitoring was measuring replication throughput, not data completeness. The distinction matters enormously.

The DNS Problem

Failing over databases and application servers was one thing. Updating DNS to point traffic to the DR site was another. Our plan said to update DNS records with 60-second TTL, and within a few minutes all traffic would route to DR.

What we didn’t account for was that many of our integrations had IP addresses hardcoded or cached in configuration. The application that depended on DNS worked fine. The dozen integrations that cached IP addresses kept trying to connect to the failed-over primary site and timing out.

We spent two hours identifying all the places where IPs were cached and updating them. This wasn’t in the runbook because nobody realized it was necessary. It was undocumented technical debt that only became visible during the test.

The Third-Party Dependencies

Our disaster recovery plan covered infrastructure we control. It didn’t adequately account for third-party dependencies. We use several SaaS services that integrate with our systems. Our DR site uses different IP addresses than production. Some of these SaaS services had our production IPs allowlisted.

When we failed over, integrations with these services broke because requests were coming from unexpected IPs. We had to contact vendors to update their allowlists. On Saturday, during a test. Some vendors weren’t responsive. If this had been a real disaster, those integrations would have been down for unknown duration.

The lesson is that disaster recovery isn’t just about infrastructure you control. It’s about the entire system, including dependencies on external services. Your DR plan needs to account for this.

The Monitoring Gap

When we failed over, our monitoring systems… mostly stopped working. We had Datadog configured to monitor production infrastructure. The DR site had separate hosts and container instances. Monitoring coverage was incomplete.

We could see that servers were running and some basic health checks were passing. But detailed application metrics, error rates, and user experience monitoring weren’t working properly in DR. This meant we couldn’t really verify that the system was functioning correctly.

In a real disaster, this would be terrible. You’d be running on backup systems without confidence that everything was actually working. Users might be experiencing errors that you’re not seeing because monitoring is degraded.

We’ve since improved DR monitoring, but it was a gap we didn’t realize existed until we tested.

The Performance Issues

The DR site is configured with less capacity than production. The theory is that during a disaster, we accept degraded performance rather than paying for full capacity to sit idle. We sized DR for “essential operations” rather than full production load.

During testing, we couldn’t actually put full production load on the DR site to see if it would handle it. But we did some load testing with synthetic traffic. Performance was significantly worse than production. Acceptable for a temporary disaster scenario, but not something we could run on indefinitely.

This raises questions about what happens if the “temporary” disaster lasts longer than expected. Can we actually run the business on the DR site for days or weeks if necessary? Probably not at full capacity. But quantifying this and communicating it to the business is important.

The Fail-Back Problem

The plan focused entirely on failing over to DR. Failing back to the primary site was treated as “reverse the process.” Turns out it’s not that simple.

Once you’ve been running on DR for several hours, you have new data that only exists at the DR site. Failing back requires synchronizing that data back to the primary site, verifying data integrity, and coordinating the cutover without losing anything.

We spent longer failing back than we did failing over. Multiple times we thought we were ready to switch back, then discovered data that hadn’t synchronized properly. This is nerve-wracking in a test. It would be far worse in a real disaster when you’re under pressure to restore normal operations.

What We Changed

After the test, we made several improvements. Updated runbooks to match reality. Implemented better monitoring in DR. Documented third-party dependencies and their failover requirements. Changed database replication configuration to ensure data completeness, not just throughput.

We also created automated tests that verify key DR functionality weekly rather than annually. Can we connect to DR databases? Are DNS failover mechanisms working? Is replication actually current? These tests don’t validate everything, but they catch the most obvious problems.

And we scheduled more regular full DR tests. Not annually. Quarterly, with rotating scope. One quarter we test database failover. Next quarter we test application failover. Then network failover. This spreads the disruption while maintaining confidence that each component works.

The Cultural Problem

The real issue isn’t technical. It’s that disaster recovery is seen as insurance that you pay for but hope never to use. Nobody wants to spend time and effort maintaining systems that sit idle. It doesn’t feel productive. There’s always something more urgent.

This means DR infrastructure degrades over time. Configuration changes happen in production but not in DR. New services get deployed without DR planning. Documentation gets out of date. By the time you need DR, it’s not ready.

Making DR a priority requires leadership commitment. Budget needs to be allocated for maintaining it. Time needs to be set aside for testing it. Engineers need to include DR considerations in design decisions. This doesn’t happen automatically.

The Uncomfortable Truth

After our DR test, I’m confident that we could recover from a disaster. But it would take longer than planned, involve more improvisation than documented, and probably lose more data than promised. Our DR plan is better than nothing, but it’s not as reliable as we thought.

I suspect this is true for many organizations. The DR plans look good. The documentation is detailed. But the gap between documented procedures and actual capability is larger than anyone wants to admit.

The only way to know is to test. Not component testing. Full end-to-end disaster recovery tests where you actually try to run production workloads on backup systems. It’s disruptive, expensive, and reveals uncomfortable truths. But it’s the only way to have confidence that your DR plan actually works.

We’re better prepared now than before the test. But we’re also more humble about what we don’t know and what could go wrong. That’s probably the most valuable outcome. False confidence is dangerous. Tested confidence, even if imperfect, is far better.