September 10, 2010 By Dan Lohrmann
Editor’s Note: Dan Lohrmann is a columnist and blogger for Public CIO, a sister publication to Government Technology. Read his blog here.
There have been quite a few headlines lately about the current challenges facing Virginia’s government IT infrastructure. From a IEEE Spectrum article, to Computerworld in the U.S. to the United Kingdom’s version of Computerworld, the situation has been covered globally in the mainstream and technology press. Virginia Gov. Bob McDonnell has even announced an independent review of the recent “unacceptable” computer outage.
For the past few weeks, many technology professionals across the country have quietly been watching and hoping for the best for our colleagues in Richmond, Va., where the failure of a storage area network knocked out government services for up to a week. Despite the online criticism, technology leaders in other governments recognize the potential ramifications for all of us. Several of us believe that technology and security professionals in government need to do some “infrastructure searching” and ask: Could a similar failure happen on my network?
This is one of those moments in time when technology professionals need to take a step back and ponder those nebulous what ifs.
Technology veterans who are honest with themselves not only recognize that such outages can happen, but that we have lived through several mini-crises. Over the past two weeks, I’ve received calls and e-mails from respected colleagues across the country with comments such as: “We recently had a major outage as well ... that almost caused a similar [widespread] impact. We were very fortunate that .... [something good happened].” Somehow, in each case, they pulled through and stayed below the public radar.
In other words, as the Washington Post stated in a quote of Arizona-based Technology Analyst Robin Harris: “People in the industry are watching ... as this unfolds. There’s a lot of ‘there but for the grace of God go I’ kind of thinking.”
No, we don’t yet have insider details regarding what happened in Virginia. In fact, as I write this, I know little more than what’s available from public reports. Our team will be getting briefings from related technology vendors, but those discussions will be under a nondisclosure.
But before we get to potential action steps for the rest of us, let’s put this situation into historical context. From Y2K to 9/11 to the Northeast blackout of 2003 to spreading viruses to malware attacks to lost or stolen laptops, technology leaders are constantly being asked to prepare for and react to unexpected emergencies. Other times, the technology just doesn’t work as expected. E-mail fails — even for Google. Mission-critical systems can’t communicate, or networks go down in strange ways. Tech leaders worry about losing backup tapes containing sensitive information. Insider threats, such as San Francisco’s rogue network administrator in 2008, can get out of control.
No doubt, government IT shops know these things. We have on-site and off-site backups, disaster recovery (DR) plans, real-time redundancy, alternative systems, business recovery plans and more. We’ve dealt with weather emergencies and the aftermath of 9/11. We prepare with exercises like Cyberstorm I, II and III. We test our processes and procedures to prove we can respond and recover.
We’ve all been audited, and we respond with new approaches that are foolproof — until the functions don’t work as advertised in a crisis. Perhaps the scenario that was tested is not the one that occurs. Which leads us back to a tough question: What about my government’s technology infrastructure? We think about vendors and products. Where are our biggest weaknesses? How can we mitigate those risks and/or prepare for the unknown?
Don’t get me wrong. Following ITIL and building good DR plans are very important and we can (and need to) continue to improve in these areas. And yet we still know that unexpected things do happen. How will your team respond? Who will they call? What is done in the first few minutes is often very important in how the recovery effort will proceed for the following days and/or weeks.
So here are five things to ponder before technology fails:
Think people, process and technology.
Are the DR plans workable? Has your staff been trained to execute quickly? We have found that people issues are the hardest to prepare for and resolve. In addition, emergencies generally go bad when two or three of these are involved in an incident — and not just a single failure of technology or human error.
Communication is the key in a crisis.
Answer this: Who will your team members call and when? What will they say? Think like the fire department: How fast can the team respond? Also, proper expectations need to be set regarding recovery, or the trust will disappear between partners. Is the front line ready?
Look for the gray areas in DR and business continuity plans.
In Michigan, we’ve found that technical staff are often uncomfortable making the call to go to backups or pull the trigger on major recovery efforts. Techies tend to try to fix the problem themselves and not tell anyone. If you get management involved to quickly escalate issues, additional resources with a wider view of the problem can often remediate the issue before it spreads. Looking back, gray areas in our plans have hurt us. After the fact, we play “Monday morning quarterback” and realize we should have brought in vendor expertise earlier or gone to plan B faster.
You can never outsource the responsibility.
Where does the buck stop? No matter how good our vendor partners are, the government will always answer to the public when business functions are not available. Build a joint team and practice together with contract partners, but remember who will own the end-to-end result. Know the boundaries of contracts and test plans across those boundaries. Be accountable.
Practice makes perfect — almost.
Run drills, conduct tabletop exercises, talk about lessons learned from previous incidents, share stories and ask “what if” questions. Test scenarios. I like this quote from hall of fame football coach Vince Lombardi: “Practice does not make perfect. Only perfect practice makes perfect.”
Despite government’s best efforts, bad things will continue to happen to our technology infrastructures. It’s part of our job to help staff prepare for those situations. Like a respected football coach with a talented team and a good game plan that goes bad for any number of reasons, we need to be flexible enough to adjust and still win the ballgame. Or perhaps, after a tough loss, we need to bounce back and salvage the season.
Virginia’s government technology team — Northrop Grumman and the Virginia Information Technologies Agency (VITA) — may have done everything properly and yet still was confronted with this difficult situation. We will know more details soon enough. And yet, they are known across the country as an excellent technology program with a respected reputation for excellence and leadership. This fact alone should cause each of us to pause and take notice.
Regardless of the outcome, they have shared best practices with other states at National Association of State Chief Information Officers (NASCIO) conferences. I am sure Virginia will bounce back and grow stronger from this.
For the rest of us, as we get ready to come together for the annual NASCIO conference in Miami at the end of September, many will be thinking about Virginia’s experience. We have entered a new decade where hardware, software, security, centralized data centers, cloud computing, mobile devices and more must work together. The complexity will be a challenge for every state and local government as we strive for increased efficiency.
Therefore, we need to be looking internally and asking (one more time): If technology fails, now what?
I’d appreciate hearing your views on this situation or on similar challenges in your government technology program. Please comment on my blog.
You may use or reference this story with attribution and a link to