Most technology leaders know that sinking feeling. The phone rings, and the voice at the other end says, “The mainframe just crashed.” Or, “We lost power at the data center and some of the uninterruptible power supply units (or the generator) didn’t work properly.” Just as scary: “Our vendor’s network is down. The incident is impacting thousands of customers.”

Computer and network outages — and the corresponding ramifications — come with the IT territory. Even when services are outsourced, the ultimate responsibility still rests with the public CIO. Despite mind-numbing thoughts of “what if,” our teams must implement recovery efforts just as a fire department responds to fires. And yes, seconds matter.

While the need to activate a full-scale disaster recovery plan may be rare, operations personnel deal with varying types of critical incidents regularly. But how effective is your team in these situations? What’s your recovery time objective when things go wrong? Simply stated: Are you ready for the next significant outage? 

Key Considerations

So what are some of the keys to a successful outage remediation?

1) Understand the outage scope, your options and timelines. Just as the military wants intelligence regarding enemy movements in a war, operations leaders must quickly grasp the extent of an operational emergency. Good monitoring tools, end-to-end system management capabilities and qualified operations staff are essential for achieving timely restoration of service. 

Tip: Beyond asking what happened, ask if anything changed. Can you roll back to the previous configuration? Utilize request for changes and change control boards to track activity. In Michigan, we activate our Emergency Contact Center during major incidents to ensure that the right priority is placed on the situation. All key resources gather (virtually or in person) to coordinate recovery options. 

2) Develop clear roles and responsibilities. Early decisions are often the key. Who’s in charge and what resources are available? Should we keep fixing the problem or activate the disaster recovery plan? What resources or vendor relationships can help?

Seasoned pros who have been through outages know that conflicting information and competing interests often emerge. Sometimes the technical staff will underestimate the issue or overestimate their ability to remediate what happened, making matters worse.

Tip: Developing “run books,” compilations of the procedures and operations that the system administrator or operator carry out, can help navigate outages. A good run book includes procedures for every anticipated scenario and generally uses step-by-step decision trees to determine the effective course of action.

3) Promote excellent communication. When critical systems are down, everyone counts the minutes. Perception is reality, and while some loss-of-service situations will make the local news and others won’t, public perception can impact your actions. Remember that communication continues after systems are restored. A good root-cause analysis listing lessons learned — including people, process and technology activities — should be provided to clients after appropriate review. 

Tip: Develop an emergency communication plan for dealing with internal and external stakeholders. Don’t let this become shelfware — practice different scenarios during tabletop exercises. Meeting customer expectations and building confidence in your statements is as important as restoring service. Don’t make promises you can’t keep.

In May, Michigan had two outages that made the news. Fortunately our experienced public information officer handled all media inquiries with expert precision. He knew what questions would be asked, who to contact internally to get the facts and what to say about restoration times.

In conclusion: Despite our best efforts, technology outages are inevitable. Cloud computing and more smartphones in the enterprise will further complicate end-to-end service restoration and escalate the need to partner with vendors. Prepare now for the unexpected.

Dan Lohrmann is Michigan’s CTO and previously served as the state’s first chief information security officer. He has 25 years of worldwide security experience, and has won numerous awards for his leadership in the information security field.

Dan Lohrmann Dan Lohrmann  |  Michigan's Chief Security Officer

Daniel J. Lohrmann became Michigan's first chief security officer (CSO) and deputy director for cybersecurity and infrastructure protection in October 2011. Lohrmann is leading Michigan's development and implementation of a comprehensive security strategy for all of the state’s resources and infrastructure. His organization is providing Michigan with a single entity charged with the oversight of risk management and security issues associated with Michigan assets, property, systems and networks.

Lohrmann is a globally recognized author and blogger on technology and security topics. His keynote speeches have been heard at worldwide events, such as GovTech in South Africa, IDC Security Roadshow in Moscow, and the RSA Conference in San Francisco. He has been honored with numerous cybersecurity and technology leadership awards, including “CSO of the Year” by SC Magazine and “Public Official of the Year” by Governing magazine.

His Michigan government security team’s mission is to:

  • establish Michigan as a global leader in cyberawareness, training and citizen safety;
  • provide state agencies and their employees with a single entity charged with the oversight of risk management and security issues associated with state of Michigan assets, property, systems and networks;
  • develop and implement a comprehensive security strategy (Michigan Cyber Initiative) for all Michigan resources and infrastructure;
  • improve efficiency within the state’s Department of Technology, Management and Budget; and
  • provide combined focus on emergency management efforts.

He currently represents the National Association of State Chief Information Officers (NASCIO) on the IT Government Coordinating Council that’s led by the U.S. Department of Homeland Security. He also serves as an adviser on TechAmerica's Cloud Commission and the Global Cyber Roundtable.

From January 2009 until October 2011, Lohrmann served as Michigan's chief technology officer and director of infrastructure services administration. He led more than 750 technology staff and contractors in administering functions, such as technical architecture, project management, data center operations, systems integration, customer service (call) center support, PC and server administration, office automation and field services support.

Under Lohrmann’s leadership, Michigan established the award-winning Mi-Cloud data storage and hosting service, and his infrastructure team was recognized by NASCIO and others for best practices and for leading state and local governments in effective technology service delivery.

Earlier in his career, Lohrmann served as the state of Michigan's first chief information security officer (CISO) from May 2002 until January 2009. He directed Michigan's award-winning Office of Enterprise Security for almost seven years.

Lohrmann's first book, Virtual Integrity: Faithfully Navigating the Brave New Web, was published in November 2008.  Lohrmann was also the chairman of the board for 2008-2009 and past president (2006-2007) of the Michigan InfraGard Member's Alliance.

Prior to becoming Michigan's CISO, Lohrmann served as the senior technology executive for e-Michigan, where he published an award-winning academic paper titled The Story — Reinventing State Government Online. He also served as director of IT and CIO for the Michigan Department of Management and Budget in the late 1990s.

Lohrmann has more than 26 years of experience in the computer industry, beginning his career with the National Security Agency. He worked for three years in England as a senior network engineer for Lockheed Martin (formerly Loral Aerospace) and for four years as a technical director for ManTech International in a U.S./UK military facility.

Lohrmann is a distinguished guest lecturer for Norwich University in the field of information assurance. He also has been a keynote speaker at IT events around the world, including numerous SecureWorld and ITEC conferences in addition to online webinars and podcasts. He has been featured in numerous daily newspapers, radio programs and magazines. Lohrmann writes a bimonthly column for Public CIO magazine on cybersecurity. He's published articles on security, technology management, cross-boundary integration, building e-government applications, cloud computing, virtualization and securing portals.

He holds a master’s degree in computer science from Johns Hopkins University in Baltimore and a bachelor’s degree in computer science from Valparaiso University in Indiana.

NOTE: The postings on this blog are Dan Lohrmann's own views. The opinions expressed do not necessarily represent the state of Michigan's official positions.

Recent Awards:
2011 Technology Leadership Award: InfoWorld
Premier 100 IT Leader for 2010: Computerworld magazine
2009 Top Doers, Dreamers and Drivers: Government Technology magazine
Public Official of the Year: Governing magazine — November 2008
CSO of the Year: SC Magazine — April 2008
Top 25 in Security Industry: Security magazine — December 2007
Compass Award: CSO Magazine — March 2007
Information Security Executive of the Year: Central Award 2006