Power Outages in Michigan

Archive for the ‘Business Continuity’ Category

Power Outages in Michigan

Posted by

Many Michigan data centers are on generator power this morning. From the Associated Press, “Wind gusting more than 60 mph knocked out power to about 413,000 Michigan homes and businesses on Sunday as temperatures dipped back into the 20s and 30s.” AT&T Wireless was out state wide yesterday. Level 3 Communications, a backbone Internet provider, also had an outage yesterday morning in the Detroit NOC.

How are your disaster recovery strategies faring?

Baseline Article on Business Continuity Planning

Posted by

Baseline has an article on Best Practices in Disaster Recovery, Business Continuity Planning. “… disaster recovery priorities depend on the nature of the system. ‘We take snapshots ranging from every hour to every 15 minutes, depending on our systems,’ says Wolfgang Goerlich, network operations and security manager for the Birmingham, Mich.-based investment banking firm. ‘Our top-tier systems, such as trading, can have an issue if we lose even 15 minutes. Lower-tier systems, such as research, just generate reports once a day, so if they lose data for [a few] hours, it isn’t as big of an issue. With our lowest-tier systems, our DR plan is to go out and buy boxes and bring them up in a couple of weeks.'”

“The key thing for us was a very short recovery-time objective,’ says Goerlich. The firm uses Compellent’s virtual storage arrays, with the DR baked in. He says it takes just one click to activate DR and boot up the systems on a new box.”

Virtualization for Disaster Recovery: Strategies

Posted by

Using virtualization as a disaster recovery strategy can in one of two scenarios:

First scenario is vm to vm. Put a hypervisor at the production site and another at the recovery site. Run the production server in a vm. Replicate the vm drives to the recovery site. During a disaster, boot the vm up on the recovery hypervisor.

The second scenario is bare metal to vm. Put a physical server running on bare metal at the production site. Stage the physical server with the necessary vm drivers (in Hyper-V, this is called the Integration Components.) Put a hypervisor at the recovery site. Replicate the disks. During a disaster, boot the server up as a vm on the recovery hypervisor. The second scenario requires block level replication and the ability for the hypervisor to read native disks. If both of these requirements are not possible, an alternative solution exists. This is to restore the production server into a vm using software that supports VM P2V DR. Examples of this software include Acronis, Arcserve, and Backup Exec. The downside is that this option takes significantly longer.

Virtualization for Disaster Recovery: Metrics

Posted by

Some quick thoughts on using server virtualization for disaster recovery. The key metrics in using VMs for DR is RTO and RPO. These are defined during the BIA process. One question that I wrestled with was how to get a near time RTO (within minutes before the disaster) and a rapid RPO (within 1hours after the disaster).

Traditional P2V techniques rely on a live system or a nightly backup, so RTO is up to 24 hours. Traditional P2V also relies upon writing the data back out into virtual disks, so the RPO for our average server was up to 7 hours. We addressed these challenges by keeping the storage on a backend SAN and pointing the disk into the VM in the event of a disaster. The RTO is then near time and the RPO is an hour or less.

The DR strategy requires native NTFS disk access and SAN support. Both VMware ESX and Hyper-V support this type of DR. Linux based hypervisors such as Xen do not.

Selecting backup data centers for DR

Posted by

Business continuity and disaster recovery have been on my mind a lot lately. The SNW conference is fast approaching and I am putting the final touches on my slide deck. One question is when and where a company should open a backup data center.

First, and I cannot stress this enough, do an impact analysis. Do you really need another data center? The text book example is the company, in an earthquake zone, which determines that bolting server racks down and buying additional insurance provides the same level of protection at significantly less cost. Your organization does not operate in a textbook, of course, and you may very well need another data center.

Having made the business case and established the budget, the next question is where to locate the facility. The following should be researched and considered:

  • Access – road, rail, and air, telecommunications
  • Proximity to current data center (under 30 miles makes real-time fail-over possible)
  • Local crime rates (history of protests, strikes, or riots)
  • Municipal services (police, fire, ambulance, power)
  • Wind patterns (is this downwind from nuclear power plants or military targets?
  • Weather patterns (hurricanes, tornadoes, et cetera)
  • Geophysical conditions (fault lines and earth quakes)

Gather all of this information and begin looking possible sites. Look for sites that are within budget and near high speed Internet backbone links. Narrow these down to those with redundant power distribution points. Then consider such things as wind and weather. This should narrow the possible sites down quite a bit. Then begin considering how your organization will transport people to this location. Airlines are best, but flights may be grounded in a widespread disaster, so also look for wide accessible highways.

In the end, come up with the punch list of three to five ideal sites. Go out for a site visit and confirm your assumptions. Some sites might not make the cut upon visiting, as your information may be out of date. At the end of this process, you will have done the homework and due diligence to make the recommendation to senior management.

The Machine Stops

Posted by

The BlackBerry network went down today. The outage lasted about three hours. Roughly speaking, that’s about three hundred messages, blog posts, and feed updates. I got nothing. Nothing at all. Nothing to read, nothing to learn, nothing to think about. The silence was deafening.

Yet silence did give me time to think. In our cybercentric society, connectivity is our lifeblood. Being disconnected brings a weird bloodless feeling. It reminded me of some stories I had read about the dystopian future, where mankind becomes overly dependent upon technology. What would I do if the BlackBerry network stayed disconnected?

Just as I had this thought, the connectivity picked back up. Feeds poured into my device. A hundred voices asked: have you had any ideas lately?

Back to the machine.

Out and About: Storage Networking World

Posted by

I will be out at the Storage Networking World Conference onApril 7 thru 10. On Tuesday, I am holding a session in the Business Continuity/Data Protection track. The topic is Simplifying Business Continuity Planning using OS and Storage Virtualization. Hope to see you there.

Abstract: This session presents the evolution of disaster recovery. An institution responsible for billions in assets, Munder Capital Management’s information systems must be always available. Munder has been thru several BCP cycles as they went from tape to standby systems, from cold to hot sites. This session delves into the lessons learned from these DR strategies as well as presents their latest: use OS and storage virtualization to completely automate recovery.

Budgeting for disaster recovery

Posted by

What is your budget for disaster recovery?

Are you spending too much?

Or too little?

Ideally, Disaster Recovery is a program that contains one or more strategies. Each strategy is a specific way to recover IT systems for one or more business processes. For example, you have a hot site strategy with one-to-one duplication of the hardware and software used in production. This strategy is costly and so it protects the critical business processes. For non-critical processes, you may have a cold site strategy. Basically, you’ll buy new hardware and restore from tape should an outage occur.

There is a means to calculate the budget for Disaster Recovery.

Step one is to determine the likelihood. Map the IT software and hardware to the business process. Determine the threats you are protecting against (fire, flood, earthquake). Do some digging to estimate the how likely these threats are to occur (Annualized Rate of Occurrence).

Step two is to determine the financial impact. If the impact occurs, everything is offline, the business process grounds to a halt, how hard will that hit the business? Quantify the impact in terms of dollars (Single Loss Expectancy).

Step three is to multiply the two to determine the Annualized Loss Expectancy. ALE = SLE * ARO

ALE is represents the business’s asset exposure. It is the most that should be spent on a Disaster Recovery strategy that mitigates the risk. If ALE is $50,000 and the recovery strategy costs $100,000 a year, then obviously you are spending too much. If you are spending $10,000 a year, you are either a hero or putting the business process at risk by using an insufficient strategy.

I find that organizations who are just starting out with Business Continuity and Disaster Recovery greatly benefit  from this budgeting method. It demonstrates a clear link between assets and protection. This way, an IT team can cost justify investments in Disaster Recovery systems.

Disaster Recovery Metrics

Posted by

About the only thing an engineer enjoys more than building technology is competing with the technology.

Consider. The first car race was in 1895 in Chicago. In 1904, Henry Ford himself set the speed record by racing down Lake St. Clair near my home town. That’s right. Years before the Ford Model T hit the market in 1908, when just getting a car to move was an accomplishment, people were already racing their technology.

In the same vein, building a disaster recovery strategy is good fun. Racing the strategy, tuning and tweaking it, optimizing, it, now that is even better. I therefore offer up the following metrics, DR speedometers if you will, for racing your technology.

Recovery Time Objective (RTO). How soon after an event occurs can you recover operations? Typically measured in hours or days, RTO is the time it takes to resume systems, applications, and business operations. RTO is going from the production facility to the disaster recovery facility.

Recovery Point Objective (RPO). How close to when an event occurs can you recover data? Typically measured in hours or days, RPO is the time between the event and the last backup or copy of your business data.

Return to normal operations (RNO). How soon after an event clears can you resume in your production facilities? Typically measured in days or weeks, RNO is the time it takes to go from the disaster recovery facility back to the production facility.

Recovery time granularity (RTG). How many backup jobs or data copies are available within your RPO? Typically measured as a count. For example, assuming a nightly backup and an RPO of one week. The RTG will be 7 as there are a maximum of 7 backup jobs in the 1 week RPO.

Recovery consistency characteristic (RCC). Does the data require consistency across multiple hard drives or logical volumes? Typically a yes or no metric that is applicable mainly to business databases and data warehouses.

Recovery object granularity (ROG). What level of recovery is needed to resume operations? Typically a list, such as: system level (multiple servers); server level (multiple hard drives); hard drive level (multiple folders); folder level; file level; item or brick level.

Recovery service scalability (RSS). How scalable is this particular method of recovery? Qualitative metric that identifies the bottleneck in the recovery method. For example, how many tapes can be restored at one time may limit a recovery strategy based on tapes.

Recovery service resiliency (RSR). How tolerant is the disaster recovery to subsequent disasters. Qualitative metric that identifies ways to continue the disaster recovery in the face of other failures and outages.

Recovery management cost (RMC). How cost effective is the recovery strategy? Typically measured as a percentage. RMC is the disaster’s per incident cost divided by the recovery equipment’s operating cost. RMC represents the efficiency of a given recovery strategy.