September 6, 2010 By Dan Lohrmann
There have been quite a few headlines lately about the current challenges facing Virginia's government technology infrastructure . From this IEEE Spectrum article, to Computerworld in the USA to the United Kingdom's version of the Computerworld Magazine, the situation has been covered globally in the mainstream and technology press. Virginia Governor Bob McDonnell has even announced an independent review of the recent "unacceptable" computer outage .
For the past few weeks, many technology professionals around the country have quietly been watching and hoping for the best for our colleagues in Richmond, Virginia. Despite online criticism , technology leaders in other governments recognize the potential ramifications for all of us. Several of us believe that technology and security pros in government need to do some infrastructure-searching and ask: could a similar failure happen on my network? This is one of those "moments in time" when technology professionals need to take a step back and ponder those nebulous "what ifs."
Honest technology veterans not only recognize that such outages can happen, we have lived through several mini-crisis situations. Over the past two weeks, I've received calls and e-mails from respected colleagues around the country with comments such as: "We recently had a major outage as well... that almost caused a similar (widespread) impact. We were very fortunate that.... (some good thing happened)." Somehow, in each case, they pulled through and stayed below the public radar.
Or, as the Washington Post stated in a quote of an Arizona technology analyst named Robin Harris: "People in the industry are watching ... as this unfolds. There's a lot of 'there but for the grace of God go I' kind of thinking."
No, we don't have insider details regarding what happened in VA. In fact, as I write this blog, I know little more than what's available from public reports. (Our team will be getting briefings from related technology vendors this week, but those discussions will be under a non-disclosure.)
But before we get to potential action steps for the rest of us, let's put this situation into historical context. From Y2K to 9/11 to the Northeast blackout of 2003 to spreading viruses to malware attacks to lost or stolen laptops, technology leaders are constantly being asked to prepare for and react to unexpected emergencies. Other times, the technology doesn't work as expected. Email fails - even for Google . Mission-critical systems can't communicate, or networks go down in strange ways. Tech leaders worry about losing backup tapes containing sensitive information. Insider threats, such as this incident in San Francisco in 2008, can get out of control.
No doubt, government technology shops know these things. We have onsite and offsite backups, DR plans, real-time redundancy, alternative systems, business recovery plans and more. We've dealt with weather emergencies and the aftermath of 9/11. We prepare with exercises like Cyberstorm I, II & III. We test our processes and procedures to prove we can respond and recover.
We've all been audited, and we respond with new approaches that are foolproof - until the functions don't work as advertised in a crisis. Perhaps the scenario that was tested is not the one that occurs. Which leads us back to that tough question - what about my government's technology infrastructure? We think about vendors and products. Where are our biggest weaknesses? How can we mitigate those risks and/or prepare for the unknown?
Don't get me wrong. Following ITIL and building good DR plans are very important and we can (and need to) continue to improve in these areas. And yet we still know that unexpected things do happen. How will your team respond? Who will they call? What is done in the first few minutes is often very important in how the recovery effort will proceed for the following days and/or weeks.
So here are five things to ponder before technology fails:
1) Think people, process and technology. Are the DR plans workable? Has your staff been trained to execute quickly? We have found that people issues are the hardest to prepare for and resolve. In addition, emergencies generally go bad when two or three of these are involved in an incident - and not just a single failure of technology or a human error.
2) Communication is the key in a crisis. Answer this: Who will your team members call and when? What will they say? Just like the fire department: How fast can the team respond? Also, proper expectations need to be set regarding recovery, or the trust will disappear between partners. Is the front-line ready?
3) Look for the gray areas in DR and business continuity plans. In Michigan, we've found that technical staff are often uncomfortable making the call to go to backups or pull the trigger on major recovery efforts. Techies tend to try to fix the problem themselves and not tell anyone. If you get management involved to quickly escalate issues, additional resources with a wider view of the problem can often remediate the issue before it spreads. Looking back, gray areas in our plans have hurt us. After the fact, we play "Monday morning quarterback" and realize we should have brought in vendor expertise earlier or gone to "Plan B" faster.
4) You can never outsource the responsibility. Where does the buck stop? No matter how good our vendor partners are, the government will always answer to the public when business functions are not available. Build a joint team and practice together with contract partners, but remember who will own the end-to-end result. Know the boundaries of contracts and test plans across those boundaries. Be accountable.
5) Practice makes perfect - almost. Run drills, conduct tabletop exercises, talk about lessons learned from previous incidents, share stories, ask "what if" questions. Test scenarios . I like this quote from Vince Lombardi: " Practice does not make perfect. Only perfect practice makes perfect."
Despite our best efforts, bad things will continue to happen to our technology infrastructures. It is part of our job to help staff prepare for those situations. Like a respected football coach with a talented team and a good game plan that goes bad for any number of reasons, we need to be flexible enough to adjust and still win the ballgame. Or perhaps, after a tough loss, we need to bounce back and salvage the season.
Virginia's government technology team may have done everything properly and yet still be confronted with this difficult situation. We will know more details soon enough. And yet, they are known around the country as an excellent technology program with a respected reputation for excellence and leadership. This fact alone should cause each of us to pause and take notice.
Regardless of the outcome, they are also respected partners in government who have shared best practices with other states at National Association of CIOs (NASCIO) conferences. I am sure Virginia will bounce back and grow stronger through this.
For the rest of us, as we get ready to come together for the annual NASCIO conference in Miami at the end of this month, many will be thinking about Virginia's experience. We have entered a new decade where hardware, software, security, centralized data centers, cloud computing, mobile devices and more must work together. The complexity will be a challenge for every state and local government as we strive for increased efficiency.
Therefore, we need to be looking internally and asking (one more time): If technology fails, now what?
I'd appreciate hearing your views on this situation or on similar challenges in your government technology program.
Building effective virtual government requires new ideas and hard work. Security professionals need to be enablers of innovation. From helpful Internet training to defending cloud computing architectures to securing mobile devices, Dan Lohrmann will cover what's hot and what's not in protecting your corner of cyberspace.