The Big Halt: 7 Lessons from Recent Computer Outages

On July 8, 2015, a string of major computer outages occurred at approximately the same time - grabbing global media attention. Significant operational disruptions occurred as a result of computer incidents at the New York Stock Exchange (NYSE), the Wall Street Journal (WSJ) and United Airlines. The nation briefly 'woke-up' to our reliance on technology and got a small taste of the fear that may come if a cyberattack cripples critical infrastructure. What lessons can we learn from these incidents? How can public and private-sector enterprises better prepare for more inevitable disruptions?

by / July 12, 2015


Credit: AP/Seth Wenig

The nation breathed a collective sigh of relief last Wednesday evening. Global news anchors used calming words as they explained that serious operational outages across three different sectors of the U.S. economy were just a series of rare, but unrelated, computer glitches. While the images being shown depicted stunned traders on the floor of the New York Stock Exchange (NYSE) and passengers looking befuddled in long lines at airports, the expert analysis emphasized that this was not a cyberattack.

As I listened and read key messages on CNN, Fox News, ABC, several major newspapers and PBS Newshour on Wednesday evening, the words from technology and security experts morphed together into one. In total, the major themes can be summarized as:

As unlikely as this may seem, there were three different major computer outages that occurred at the same time as a result of human error, configuration errors or router failures.

There is no need to be alarmed, these things do occur on a regular basis....

This was not the work of hackers or outside terrorist threats! The U.S. Department of Homeland Security (DHS) and the White House have both confirmed this – to the best of their knowledge. ...

Yes, it was a bit unnerving, but we should not be surprised. It only happens every few years....

The LA Times reported, “On most days, the Internet and the myriad systems it powers can be counted on to work well enough. But security experts say problems are inevitable, through hacking, human error, broken cables, buggy code or other unforeseen issues.”

In reality, we should be glad that these events are rare - even amazing that computer-related interruptions don’t happen more often....

However, even though hackers were not involved, the three incidents do show our nation’s precarious dependence on technology….

Almost Like a Movie Trailer

The series of incidents could have ended very differently. For a few hours, disparate events appeared very similar to hacker movie trailers describing the early hours of a major cyberattack against the nation. The feeling in the air was very tense on Wednesday afternoon. Even those who continually fight against conspiracy theories later admitted that they were worried. 

Across the Internet, blogs and tweets were asking: What happened? How can this be explained? When will systems be back up? Who’s in charge?

Note: the very fact that the White House and DHS made public statements in these incidents was fascinating, since these different operational issues cut across traditional silos of the transportation sector, financial sector and new media sector.

Stock prices of cybersecurity companies surged, when traders and others suspected a cyberattack.

News alerts and breaking headlines started asking root cause questions such as:

 “Was this linked to an Anonymous message from last night?”

“Could this be linked to the recent China hacking of OPM databases?”

“Perhaps this had something to do with sheer volume of stock trades or with the Greek crisis with the EU?”

“Stock market fluctuations led some to wonder whether to get out of the market, if you can’t deal with these situations?”

Some people started connecting the dots in nefarious ways with fearful thoughts of: China’ stock market troubles, Greece debt issues, airline problems and more.

Perhaps some major Internet cyberattack was going on. Could this the beginning of a Cyber Pearl Harbor?

The LA Times proclaimed, “The outages came one after another: one of the nation's biggest airlines, its largest financial news publication and its main stock exchange.”

Putting These Outage Events into Proper Context – and Asking Tough Questions

Computer outages will always happen, albeit on a irregular basis. That was another frequent point that experts made last week.

For example, Delta Airlines had a significant computer outage back in February 2015: “Passengers on Delta Air Lines Monday afternoon were unable to check in for their flights because of a malfunction with the airline's website, mobile app and airport kiosks.”

Southwest Airlines' website was down last month because a fall travel sale increased load beyond expectations.

And this is not the first time that the NYSE or other major stock markets were down. The history of stock market outages goes back years.

Nevertheless, many news stations were asking fundamental questions during the aftermath of these outages. A sampling of questions included:

Can we keep these major events to a minimum?

How can we respond more effectively when we do? Are organizations truly prepared?

Can we possibly secure the Internet, with billions more people and "things" expected to join the Net in the future?

How do the dreaded configuration errors, software upgrades, and router failures keep causing these major problems?

What if, these issues were the result of deliberate insider or external cyberattacks, or if we get a blended attack that includes system outages and hacking?

What Lessons Can We Learn from This?

So can we turn these operational lemons into lemonade? What lessons learned can help us in the future as we face outages?

1) The No. 1 issue in any emergency situation (or serious incident) is: How do you communicate – both internally and externally? Customers, stockholders, 24/7 news channels and numerous others want to know what is going on. The difficulty lies in that the true cause of many outages is not known for several hours, or perhaps days, while the media wants an instant response.

Notice how CNN asked outside experts (who had general experience but were not working on these exact issues) what they thought the cause of the outage was. When the media is not getting answers, they find their own people who will talk.

What can you do? Tell the truth, but only what you truly know, and get in front of media quickly. Don’t jump to conclusions too quickly, but ensure proper internal communication between the public information officers (PIOs) and the technical teams. Know who’s in charge. Clear roles and responsibilities are key.

Have a well tested communication plan that covers different scenarios. Make sure everyone knows where to go and what is expected. Practice implementing your plans in regular exercises (or simulations). Practice. Practice. Practice.

Forbes wrote that the United Airlines outage was a huge communications failure, but I also thought that the federal government did a good job in communications on cybersecurity – given that this was a series of private-sector computer failures. 

2) Practice trouble-shooting processes for critical system restorations before outages occur. Build team skills to quickly determine what went wrong and why. Know your system so well that you can quickly determine what process caused the outage.

What if it was a cyberattack? Be careful what you say to management and when – even though you will be pressed for quick answers.

This article by James Lewis in The Washington Post makes some valid points. He was correct in this instance about no cyberattack, but I disagree with his overall conclusions about potential future incidents.

Security and technology teams must be ready for cyberattacks against critical infrastructure, and recent studies and reports show that destructive cyberattacks are on the rise.

One online comment after this article read: “The motivation behind computer crime can be anything: financial gain, curiosity, revenge, boredom, "street cred," delusions of grandeur, etc. Except for the fact that they possess a certain skill, hackers aren't any different from other types of criminals....”

3) Assuming that you are dealing with system issue(s) and not a cyberattack, roll back to an earlier version that is trustworthy. Don’t try to "fix" configuration issues, system settings or new code during an unplanned outage. Bottom line: Get back to an earlier working version of your system ASAP.

Whenever there is a computer outage, technology teams should always ask: What changed? Who updated hardware, software, firmware or something else? Assuming good configuration control, roll back to the previous working version rather than attempting to fix things in an ad hoc or untested manner.

Quick, untested, "fixes" can cause more problems – but technical staff will say, "I know how to fix this. ..."

Do not get caught in the vicious circle of fixing the problems on the fly while systems and customers are down.

TIP: If nothing was scheduled to change, ask again if someone did something that wasn’t planned or authorized? Sadly, people hide actions at work more often than they admit.

4) Work closely with internal and external partners before, during and after incidents.  This series of computer outages turned out to be a great example of how partners can coordinate, communicate and dispel myths (in this case a hack attack). These partners include business delivery partners, government organizations, clients and many more.

One cyber exercise that we conducted in Michigan a few years back brought together hundreds of people and dozens of internal and external partners. There were many takeaways that strengthened our emergency plans. 

Ask: What are partners seeing? How can you coordinate messages and actions? What comes next? Release joint press announcements where possible.

5) Plan, anticipate and coordinate incident actions using an incident management plan that is frequently tested.  Ask “what if” regarding outages and cyberattacks.  Hopefully, system changes have been fully tested before being placed into the operational network. But obviously, in these cases, something went wrong.

You should have written scenarios that everyone on the team needs to know and understand. Address 'what-if' scenarios. 

This means that someone must make the decision when to roll back to earlier versions. More importantly, who decides when to declare an incident is an emergency and to inform the right people using the emergency management plan or incident management plan as the guide?

Ask: How much time is needed to restart operations? Do we have the right staff in place? If not, get them to report immediately. Are the disaster recovery (DR) plans up to snuff?  How do you relate cybersecurity incident response plans to operational computer or system outage plans? Should this be done on a weekend? Do we have enough outage downtime planned?

Sometimes system updates and upgrades need to be spread out over longer periods of time. The end of this LA Times article explains: “The problem of computer glitches has become so common that American Airlines announced in May that it planned to combine its reservation system with that of US Airways over a 90-day period to reduce the chances of a system failure.”  

6) Review your network(s) and system(s) architecture(s) on a regular basis. What single points of failure exist for critical systems? Ask basic questions again: What systems truly have the redundancy required?

I am usually laughing when news media describes a major system outage that affects millions of people as a simple router going down or a computer hard drive failing or some other very simple explanation – almost as if someone unplugged a toaster. It is almost never this simple.

My children, who are familiar with someone unplugging our home Wi-Fi connection to reboot our Internet access sometimes ask me: Can that really happen to an airline for the entire country or the NYSE? Does that make sense?

In a word – no. These systems have hundreds of millions of dollars of redundancy built into them. Something else, usually human error, causes most problems.

However, I have been in situations where “what-if” operational tests and exercises can cause an outage.

As your team changes, systems merge, upgrades occur or unanticipated elements are inserted or taken out of network and system architectures – bad things can happen. The key is to keep asking if all the complexity is needed. Strive to simply. Use ITIL to manage changes. 

TIP: Watch out for single points of failure with people as well. As baby-boomers retire and expertise walks out the door for new jobs, new staff may not know what they do not know. Poor training and backup of staff cause many computer outages.

7) Stay calm and focused. Don’t panic, and we will get through this. The key word around the world today seems to be resiliency – for system outages and cyberatacks. Build and maintain systems that can quickly bounce back – even when things go wrong. Display confident leadership.

I was amazed at how well things ended up this week, after how bad things looked early in the day on Wednesday. My hats off to all who got things back to normal.

Closing Thoughts 

The more things change, the more they stay the same. The computer outage issues that we are discussing are not new. The past few decades offer examples surrounding disaster recovery, business continuity planning and emergency management gone right and wrong.

Yes, the stakes are higher, much higher, with these major systems growing larger and the Internet becoming more complex with more cyberattacks.

The nation woke up, for at least one day this past week, to the huge dependence that we have on technology. The Luddite commentaries re-emerged, for a moment in time. Now, things are getting “back to normal,” which means most people quickly forget.

Nevertheless, we must remember that, as technology and security professionals, we have a huge responsibility to improve. In these emergencies and serious incidents, we are also fighting for the hearts and minds of our customers and the nation. We must avoid panic – in airports or stock exchanges or government services.

The question remains: Are we truly prepared to do our duty when the alarms go off?

Dan Lohrmann Chief Security Officer & Chief Strategist at Security Mentor Inc.

Daniel J. Lohrmann is an internationally recognized cybersecurity leader, technologist, keynote speaker and author.

During his distinguished career, he has served global organizations in the public and private sectors in a variety of executive leadership capacities, receiving numerous national awards including: CSO of the Year, Public Official of the Year and Computerworld Premier 100 IT Leader.
Lohrmann led Michigan government’s cybersecurity and technology infrastructure teams from May 2002 to August 2014, including enterprisewide Chief Security Officer (CSO), Chief Technology Officer (CTO) and Chief Information Security Officer (CISO) roles in Michigan.

He currently serves as the Chief Security Officer (CSO) and Chief Strategist for Security Mentor Inc. He is leading the development and implementation of Security Mentor’s industry-leading cyber training, consulting and workshops for end users, managers and executives in the public and private sectors. He has advised senior leaders at the White House, National Governors Association (NGA), National Association of State CIOs (NASCIO), U.S. Department of Homeland Security (DHS), federal, state and local government agencies, Fortune 500 companies, small businesses and nonprofit institutions.

He has more than 30 years of experience in the computer industry, beginning his career with the National Security Agency. He worked for three years in England as a senior network engineer for Lockheed Martin (formerly Loral Aerospace) and for four years as a technical director for ManTech International in a US/UK military facility.

Lohrmann is the author of two books: Virtual Integrity: Faithfully Navigating the Brave New Web and BYOD for You: The Guide to Bring Your Own Device to Work. He has been a keynote speaker at global security and technology conferences from South Africa to Dubai and from Washington, D.C., to Moscow.

He holds a master's degree in computer science (CS) from Johns Hopkins University in Baltimore, and a bachelor's degree in CS from Valparaiso University in Indiana.

Follow Lohrmann on Twitter at: @govcso