Business Continuity | J Wolfgang Goerlich

Archive for the ‘Business Continuity’ Category

Relying on Third Parties for DR

Posted February 2, 2009 by Wolf Goerlich

Many of us rely upon vendors and third-parties for our disaster recovery efforts. For example, I personally rely upon a refueling company to keep my generator topped off and a maintenance company to keep it running. Other companies rely upon shared data centers, data backup/recovery companies, and DR planners like Agility Recovery.

A weakness in these plans occurs when a regional disaster impacts multiple companies. In these scenarios, the third-party may lack the capacity to handle all the requests and be overwhelmed. One thing that happened to my data center during the winter storm (which knocked power out for five days) was that the fuel trucks were delayed for 48 hours, and the maintenance crew delayed for 24. The times are well within tolerances, but well beyond normal service levels.

Power Outages in Michigan

Posted December 29, 2008 by Wolf Goerlich

Many Michigan data centers are on generator power this morning. From the Associated Press, “Wind gusting more than 60 mph knocked out power to about 413,000 Michigan homes and businesses on Sunday as temperatures dipped back into the 20s and 30s.” AT&T Wireless was out state wide yesterday. Level 3 Communications, a backbone Internet provider, also had an outage yesterday morning in the Detroit NOC.

How are your disaster recovery strategies faring?

Baseline Article on Business Continuity Planning

Posted September 2, 2008 by Wolf Goerlich

Baseline has an article on Best Practices in Disaster Recovery, Business Continuity Planning. “… disaster recovery priorities depend on the nature of the system. ‘We take snapshots ranging from every hour to every 15 minutes, depending on our systems,’ says Wolfgang Goerlich, network operations and security manager for the Birmingham, Mich.-based investment banking firm. ‘Our top-tier systems, such as trading, can have an issue if we lose even 15 minutes. Lower-tier systems, such as research, just generate reports once a day, so if they lose data for [a few] hours, it isn’t as big of an issue. With our lowest-tier systems, our DR plan is to go out and buy boxes and bring them up in a couple of weeks.'”

“The key thing for us was a very short recovery-time objective,’ says Goerlich. The firm uses Compellent’s virtual storage arrays, with the DR baked in. He says it takes just one click to activate DR and boot up the systems on a new box.”

How Microsoft Hyper-V Helped My IT Shop Revamp Disaster Recovery

Posted August 27, 2008 by Wolf Goerlich

The article in CIO magazine, How Microsoft Hyper-V Helped My IT Shop Revamp Disaster Recovery, details my efforts to reorganize network and recovery operations around storage and server virtualization. It also mentions the year-long virtualization study that I did in 2007/2008.

Virtualization for Disaster Recovery: Strategies

Posted April 6, 2008 by Wolf Goerlich

Using virtualization as a disaster recovery strategy can in one of two scenarios:

First scenario is vm to vm. Put a hypervisor at the production site and another at the recovery site. Run the production server in a vm. Replicate the vm drives to the recovery site. During a disaster, boot the vm up on the recovery hypervisor.

The second scenario is bare metal to vm. Put a physical server running on bare metal at the production site. Stage the physical server with the necessary vm drivers (in Hyper-V, this is called the Integration Components.) Put a hypervisor at the recovery site. Replicate the disks. During a disaster, boot the server up as a vm on the recovery hypervisor. The second scenario requires block level replication and the ability for the hypervisor to read native disks. If both of these requirements are not possible, an alternative solution exists. This is to restore the production server into a vm using software that supports VM P2V DR. Examples of this software include Acronis, Arcserve, and Backup Exec. The downside is that this option takes significantly longer.

Virtualization for Disaster Recovery: Metrics

Posted April 5, 2008 by Wolf Goerlich

Some quick thoughts on using server virtualization for disaster recovery. The key metrics in using VMs for DR is RTO and RPO. These are defined during the BIA process. One question that I wrestled with was how to get a near time RTO (within minutes before the disaster) and a rapid RPO (within 1hours after the disaster).

Traditional P2V techniques rely on a live system or a nightly backup, so RTO is up to 24 hours. Traditional P2V also relies upon writing the data back out into virtual disks, so the RPO for our average server was up to 7 hours. We addressed these challenges by keeping the storage on a backend SAN and pointing the disk into the VM in the event of a disaster. The RTO is then near time and the RPO is an hour or less.

The DR strategy requires native NTFS disk access and SAN support. Both VMware ESX and Hyper-V support this type of DR. Linux based hypervisors such as Xen do not.

Selecting backup data centers for DR

Posted March 18, 2008 by Wolf Goerlich

Business continuity and disaster recovery have been on my mind a lot lately. The SNW conference is fast approaching and I am putting the final touches on my slide deck. One question is when and where a company should open a backup data center.

First, and I cannot stress this enough, do an impact analysis. Do you really need another data center? The text book example is the company, in an earthquake zone, which determines that bolting server racks down and buying additional insurance provides the same level of protection at significantly less cost. Your organization does not operate in a textbook, of course, and you may very well need another data center.

Having made the business case and established the budget, the next question is where to locate the facility. The following should be researched and considered:

Access – road, rail, and air, telecommunications
Proximity to current data center (under 30 miles makes real-time fail-over possible)
Local crime rates (history of protests, strikes, or riots)
Municipal services (police, fire, ambulance, power)
Wind patterns (is this downwind from nuclear power plants or military targets?
Weather patterns (hurricanes, tornadoes, et cetera)
Geophysical conditions (fault lines and earth quakes)

Gather all of this information and begin looking possible sites. Look for sites that are within budget and near high speed Internet backbone links. Narrow these down to those with redundant power distribution points. Then consider such things as wind and weather. This should narrow the possible sites down quite a bit. Then begin considering how your organization will transport people to this location. Airlines are best, but flights may be grounded in a widespread disaster, so also look for wide accessible highways.

In the end, come up with the punch list of three to five ideal sites. Go out for a site visit and confirm your assumptions. Some sites might not make the cut upon visiting, as your information may be out of date. At the end of this process, you will have done the homework and due diligence to make the recommendation to senior management.

The Machine Stops

Posted February 28, 2008 by Wolf Goerlich

The BlackBerry network went down today. The outage lasted about three hours. Roughly speaking, that’s about three hundred messages, blog posts, and feed updates. I got nothing. Nothing at all. Nothing to read, nothing to learn, nothing to think about. The silence was deafening.

Yet silence did give me time to think. In our cybercentric society, connectivity is our lifeblood. Being disconnected brings a weird bloodless feeling. It reminded me of some stories I had read about the dystopian future, where mankind becomes overly dependent upon technology. What would I do if the BlackBerry network stayed disconnected?

Just as I had this thought, the connectivity picked back up. Feeds poured into my device. A hundred voices asked: have you had any ideas lately?

Back to the machine.

Out and About: Storage Networking World

Posted February 10, 2008 by Wolf Goerlich

I will be out at the Storage Networking World Conference onApril 7 thru 10. On Tuesday, I am holding a session in the Business Continuity/Data Protection track. The topic is Simplifying Business Continuity Planning using OS and Storage Virtualization. Hope to see you there.

Abstract: This session presents the evolution of disaster recovery. An institution responsible for billions in assets, Munder Capital Management’s information systems must be always available. Munder has been thru several BCP cycles as they went from tape to standby systems, from cold to hot sites. This session delves into the lessons learned from these DR strategies as well as presents their latest: use OS and storage virtualization to completely automate recovery.

Budgeting for disaster recovery

Posted April 18, 2005 by Wolf Goerlich

What is your budget for disaster recovery?

Are you spending too much?

Or too little?

Ideally, Disaster Recovery is a program that contains one or more strategies. Each strategy is a specific way to recover IT systems for one or more business processes. For example, you have a hot site strategy with one-to-one duplication of the hardware and software used in production. This strategy is costly and so it protects the critical business processes. For non-critical processes, you may have a cold site strategy. Basically, you’ll buy new hardware and restore from tape should an outage occur.

There is a means to calculate the budget for Disaster Recovery.

Step one is to determine the likelihood. Map the IT software and hardware to the business process. Determine the threats you are protecting against (fire, flood, earthquake). Do some digging to estimate the how likely these threats are to occur (Annualized Rate of Occurrence).

Step two is to determine the financial impact. If the impact occurs, everything is offline, the business process grounds to a halt, how hard will that hit the business? Quantify the impact in terms of dollars (Single Loss Expectancy).

Step three is to multiply the two to determine the Annualized Loss Expectancy. ALE = SLE * ARO

ALE is represents the business’s asset exposure. It is the most that should be spent on a Disaster Recovery strategy that mitigates the risk. If ALE is $50,000 and the recovery strategy costs $100,000 a year, then obviously you are spending too much. If you are spending $10,000 a year, you are either a hero or putting the business process at risk by using an insufficient strategy.

I find that organizations who are just starting out with Business Continuity and Disaster Recovery greatly benefit from this budgeting method. It demonstrates a clear link between assets and protection. This way, an IT team can cost justify investments in Disaster Recovery systems.