Most technology leaders know that sinking feeling. The phone rings, and the voice at the other end says, “The mainframe just crashed.” Or, “We lost power at the data center and some of the uninterruptible power supply units (or the generator) didn’t work properly.” Just as scary: “Our vendor’s network is down. The incident is impacting thousands of customers.”
Computer and network outages — and the corresponding ramifications — come with the IT territory. Even when services are outsourced, the ultimate responsibility still rests with the public CIO. Despite mind-numbing thoughts of “what if,” our teams must implement recovery efforts just as a fire department responds to fires. And yes, seconds matter.
While the need to activate a full-scale disaster recovery plan may be rare, operations personnel deal with varying types of critical incidents regularly. But how effective is your team in these situations? What’s your recovery time objective when things go wrong? Simply stated: Are you ready for the next significant outage?
So what are some of the keys to a successful outage remediation?
1) Understand the outage scope, your options and timelines. Just as the military wants intelligence regarding enemy movements in a war, operations leaders must quickly grasp the extent of an operational emergency. Good monitoring tools, end-to-end system management capabilities and qualified operations staff are essential for achieving timely restoration of service.
Tip: Beyond asking what happened, ask if anything changed. Can you roll back to the previous configuration? Utilize request for changes and change control boards to track activity. In Michigan, we activate our Emergency Contact Center during major incidents to ensure that the right priority is placed on the situation. All key resources gather (virtually or in person) to coordinate recovery options.
2) Develop clear roles and responsibilities. Early decisions are often the key. Who’s in charge and what resources are available? Should we keep fixing the problem or activate the disaster recovery plan? What resources or vendor relationships can help?
Seasoned pros who have been through outages know that conflicting information and competing interests often emerge. Sometimes the technical staff will underestimate the issue or overestimate their ability to remediate what happened, making matters worse.
Tip: Developing “run books,” compilations of the procedures and operations that the system administrator or operator carry out, can help navigate outages. A good run book includes procedures for every anticipated scenario and generally uses step-by-step decision trees to determine the effective course of action.
3) Promote excellent communication. When critical systems are down, everyone counts the minutes. Perception is reality, and while some loss-of-service situations will make the local news and others won’t, public perception can impact your actions. Remember that communication continues after systems are restored. A good root-cause analysis listing lessons learned — including people, process and technology activities — should be provided to clients after appropriate review.
Tip: Develop an emergency communication plan for dealing with internal and external stakeholders. Don’t let this become shelfware — practice different scenarios during tabletop exercises. Meeting customer expectations and building confidence in your statements is as important as restoring service. Don’t make promises you can’t keep.
In May, Michigan had two outages that made the news. Fortunately our experienced public information officer handled all media inquiries with expert precision. He knew what questions would be asked, who to contact internally to get the facts and what to say about restoration times.
In conclusion: Despite our best efforts, technology outages are inevitable. Cloud computing and more smartphones in the enterprise will further complicate end-to-end service restoration and escalate the need to partner with vendors. Prepare now for the unexpected.
Dan Lohrmann is Michigan’s CTO and previously served as the state’s first chief information security officer. He has 25 years of worldwide security experience, and has won numerous awards for his leadership in the information security field.
Daniel J. Lohrmann is an internationally recognized cybersecurity leader, technologist, keynote speaker and author.
During his distinguished career, he has served global organizations in the public and private sectors in a variety of executive leadership capacities, receiving numerous national awards including: CSO of the Year, Public Official of the Year and Computerworld Premier 100 IT Leader.
Lohrmann led Michigan government’s cybersecurity and technology infrastructure teams from May 2002 to August 2014, including enterprisewide Chief Security Officer (CSO), Chief Technology Officer (CTO) and Chief Information Security Officer (CISO) roles in Michigan.
He currently serves as the Chief Security Officer (CSO) and Chief Strategist for Security Mentor Inc. He is leading the development and implementation of Security Mentor’s industry-leading cyber training, consulting and workshops for end users, managers and executives in the public and private sectors. He has advised senior leaders at the White House, National Governors Association (NGA), National Association of State CIOs (NASCIO), U.S. Department of Homeland Security (DHS), federal, state and local government agencies, Fortune 500 companies, small businesses and nonprofit institutions.
He has more than 30 years of experience in the computer industry, beginning his career with the National Security Agency. He worked for three years in England as a senior network engineer for Lockheed Martin (formerly Loral Aerospace) and for four years as a technical director for ManTech International in a US/UK military facility.
Lohrmann is the author of two books: Virtual Integrity: Faithfully Navigating the Brave New Web and BYOD for You: The Guide to Bring Your Own Device to Work. He has been a keynote speaker at global security and technology conferences from South Africa to Dubai and from Washington, D.C., to Moscow.
He holds a master's degree in computer science (CS) from Johns Hopkins University in Baltimore, and a bachelor's degree in CS from Valparaiso University in Indiana.
Follow Lohrmann on Twitter at: @govcso