Managing an IT crisis can be complex and difficult. But government CIOs and their operations teams can change the situation and improve the chances for success with better advanced planning, coordination and training.
In times of crisis, first responder teams know that coordination, cooperation and communication are the keys to success. These elements provide police, EMTs and firefighters with detailed situational awareness — a common operating picture that allows them to quickly and efficiently identify the issue and immediately begin working on a solution.
IT operations administrators are not first responders. Yet frequently they find themselves fighting crime, putting out fires and rushing to save critical technology services. Unlike emergency responders, IT operations teams today do not have the necessary tools to form that common operating picture.
Whether it’s constrained budgets, inadequate IT strategy, a lack of leadership and coordination, or another reason, too often, IT teams are unprepared for a tech emergency. That means when mission-critical systems go down — for example, government functions are interrupted by glitches, a 911 dispatch center is breached, or a flight management system outage strands thousands — recovery is a scramble.
That lack of preparedness poses a problem as technology will play an increasingly important role in tomorrow’s government. For CIOs and IT operations teams, the remedy can be found by adopting from first responders three lessons that they practice when it comes to preparedness:
Among the biggest challenges in managing IT operations today are the siloed nature and increasing complexity of IT systems. New applications now live in hybrid IT infrastructures, operating partially on-premises and partially in the cloud. Meanwhile, legacy applications remain too ingrained to upgrade or replace, growing older and less reliable each year.
Further compounding the problem, budgets are not increasing to accommodate needed modernization and there is a growing dichotomy between the longevity of an IT solution and the quick turnover rate of IT personnel. The result? IT teams are using management solutions they had no hand in choosing or deploying and are in charge of managing systems they don't fully understand. It seems basic, but the importance of visibility cannot be understated. Being able to see across the IT environment is a critical first step toward better, faster IT crisis response.
Visibility, though, is only the first part of better problem management. Context also matters. When IT problems arise, organizations typically pull together a “tiger team” — a group of systems administrators each responsible for a specific part of the IT stack — to respond. In theory, these individuals work to discover where the problem lies and come up with a fix. In practice, however, they each approach the problem via their own set of siloed tools, working first to prove their team is not at fault and absolve themselves of responsibility. This “blamestorming” breeds confusion, inefficiencies, and divisiveness. It forestalls actual root cause determination and prolongs the ability for remediation efforts to begin in earnest.
The reality is that without a holistic picture, everything might appear to be working just fine within each of the siloed teams even though the end-user experience is clearly demonstrating otherwise.
Modern IT stacks are highly interconnected and IT teams need to understand not only their own systems, but how they interrelate with others. Too often, IT teams are resigned to looking for problems only in areas where they have visibility — but all the while, the root cause can very likely lie elsewhere.
Only by “helicoptering up” to get a broad, contextualized view can you form the common operating picture needed to quickly address challenges or prevent an existing challenge from growing worse.
Performance is perfected with practice. Just as first responders train for critical situations, IT operations teams should also prepare for problems before they arise. One of the most powerful tools for delivering visibility and situational awareness is a strong application performance monitoring (APM) and end-user experience monitoring platform.
However, more often than not, APM solutions are only deployed reactively. The reality is that emergency preparedness can greatly decrease mean time to resolution when problems occur. IT operations teams and CIOs should think proactively about what infrastructure and applications they own, who is responsible for managing them, how their operations are structured, what visibility and collaboration challenges might arise and what tools they have (or more importantly do not have) to address them.
Leveraging an APM solution in advance of problems has been proven to reduce the time it takes to remediate critical issues by as much as 70 percent. It also can help to reduce the likelihood that problems will occur in the first place. Most important, through early warning alerts, APMs can proactively identify issues so disruptions to the end-user experience are either greatly diminished or never occur at all. Indeed, this is the approach the state of Texas took with its Texas.gov website and application, the central portal where 27 million Texas citizens go to connect with government services.
Technology is a double-edged sword. On one hand, it is indispensable for the efficiency and services it allows state and local governments to deliver to citizens. But on the other, reliance on it means that when it fails, there are serious consequences.
IT teams need to proactively think about how they can better manage the complexity and siloed nature of IT systems today. Taking a few lessons from first responders, who also need to ingest, understand and respond to disparate information in critical situations, can serve as solid rules of the road on how to approach IT crisis management.