Technology outages are inevitable, so it’s important to prepare now for the unexpected.
Most technology leaders know that sinking feeling. The phone rings, and the voice at the other end says, “The mainframe just crashed.” Or, “We lost power at the data center and some of the uninterruptible power supply units (or the generator) didn’t work properly.” Just as scary: “Our vendor’s network is down. The incident is impacting thousands of customers.”
Computer and network outages — and the corresponding ramifications — come with the IT territory. Even when services are outsourced, the ultimate responsibility still rests with the public CIO. Despite mind-numbing thoughts of “what if,” our teams must implement recovery efforts just as a fire department responds to fires. And yes, seconds matter.
While the need to activate a full-scale disaster recovery plan may be rare, operations personnel deal with varying types of critical incidents regularly. But how effective is your team in these situations? What’s your recovery time objective when things go wrong? Simply stated: Are you ready for the next significant outage?
So what are some of the keys to a successful outage remediation?
1) Understand the outage scope, your options and timelines. Just as the military wants intelligence regarding enemy movements in a war, operations leaders must quickly grasp the extent of an operational emergency. Good monitoring tools, end-to-end system management capabilities and qualified operations staff are essential for achieving timely restoration of service.
Tip: Beyond asking what happened, ask if anything changed. Can you roll back to the previous configuration? Utilize request for changes and change control boards to track activity. In Michigan, we activate our Emergency Contact Center during major incidents to ensure that the right priority is placed on the situation. All key resources gather (virtually or in person) to coordinate recovery options.
2) Develop clear roles and responsibilities. Early decisions are often the key. Who’s in charge and what resources are available? Should we keep fixing the problem or activate the disaster recovery plan? What resources or vendor relationships can help?
Seasoned pros who have been through outages know that conflicting information and competing interests often emerge. Sometimes the technical staff will underestimate the issue or overestimate their ability to remediate what happened, making matters worse.
Tip: Developing “run books,” compilations of the procedures and operations that the system administrator or operator carry out, can help navigate outages. A good run book includes procedures for every anticipated scenario and generally uses step-by-step decision trees to determine the effective course of action.
3) Promote excellent communication. When critical systems are down, everyone counts the minutes. Perception is reality, and while some loss-of-service situations will make the local news and others won’t, public perception can impact your actions. Remember that communication continues after systems are restored. A good root-cause analysis listing lessons learned — including people, process and technology activities — should be provided to clients after appropriate review.
Tip: Develop an emergency communication plan for dealing with internal and external stakeholders. Don’t let this become shelfware — practice different scenarios during tabletop exercises. Meeting customer expectations and building confidence in your statements is as important as restoring service. Don’t make promises you can’t keep.
In May, Michigan had two outages that made the news. Fortunately our experienced public information officer handled all media inquiries with expert precision. He knew what questions would be asked, who to contact internally to get the facts and what to say about restoration times.
In conclusion: Despite our best efforts, technology outages are inevitable. Cloud computing and more smartphones in the enterprise will further complicate end-to-end service restoration and escalate the need to partner with vendors. Prepare now for the unexpected.
Dan Lohrmann is Michigan’s CTO and previously served as the state’s first chief information security officer. He has 25 years of worldwide security experience, and has won numerous awards for his leadership in the information security field.