About the only thing an engineer enjoys more than building technology is competing with the technology.
Consider. The first car race was in 1895 in Chicago. In 1904, Henry Ford himself set the speed record by racing down Lake St. Clair near my home town. That’s right. Years before the Ford Model T hit the market in 1908, when just getting a car to move was an accomplishment, people were already racing their technology.
In the same vein, building a disaster recovery strategy is good fun. Racing the strategy, tuning and tweaking it, optimizing, it, now that is even better. I therefore offer up the following metrics, DR speedometers if you will, for racing your technology.
Disaster Recovery Metrics
Recovery Time Objective (RTO). How soon after an event occurs can you recover operations? Typically measured in hours or days, RTO is the time it takes to resume systems, applications, and business operations. RTO is going from the production facility to the disaster recovery facility.
Recovery Point Objective (RPO). How close to when an event occurs can you recover data? Typically measured in hours or days, RPO is the time between the event and the last backup or copy of your business data.
Return to normal operations (RNO). How soon after an event clears can you resume in your production facilities? Typically measured in days or weeks, RNO is the time it takes to go from the disaster recovery facility back to the production facility.
Recovery time granularity (RTG). How many backup jobs or data copies are available within your RPO? Typically measured as a count. For example, assuming a nightly backup and an RPO of one week. The RTG will be 7 as there are a maximum of 7 backup jobs in the 1 week RPO.
Recovery consistency characteristic (RCC). Does the data require consistency across multiple hard drives or logical volumes? Typically a yes or no metric that is applicable mainly to business databases and data warehouses.
Recovery object granularity (ROG). What level of recovery is needed to resume operations? Typically a list, such as: system level (multiple servers); server level (multiple hard drives); hard drive level (multiple folders); folder level; file level; item or brick level.
Recovery service scalability (RSS). How scalable is this particular method of recovery? Qualitative metric that identifies the bottleneck in the recovery method. For example, how many tapes can be restored at one time may limit a recovery strategy based on tapes.
Recovery service resiliency (RSR). How tolerant is the disaster recovery to subsequent disasters. Qualitative metric that identifies ways to continue the disaster recovery in the face of other failures and outages.
Recovery management cost (RMC). How cost effective is the recovery strategy? Typically measured as a percentage. RMC is the disaster’s per incident cost divided by the recovery equipment’s operating cost. RMC represents the efficiency of a given recovery strategy.