May 21, 2011 /
We're Up, Tired, But Smiling Again
This has been a rough week for our technology operations. The various headlines about two different (and unrelated) Michigan government outages tell you why our team is a bit behind on our sleep. The good news is that our critical Secretary of State systems are up and offices are open and helping customers. I’m happy again on a beautiful spring Saturday morning in Lansing.
Here are a few of the background articles covering the outages this week:
I know, enquiring minds want to know the “nitty gritty” about what caused the outages in the first place and specific details regarding what happened and how we recovered. That will come soon enough, with a detailed “Root Cause Analysis” (RCA) being performed on each situation. We owe those formal details to our agency customers and the public that was impacted. This RCA report will include steps we are taking to reduce the risk of such incidents reoccurring.
I also hope to do a longer article on this topic later this summer, with some behind the scenes conversations and perspectives on how we responded so quickly from two back-to-back situations. But for now, I felt I owed my blog readers an acknowledgement that the incidents did happen in Michigan – and say a few words about the Michigan outage articles. It was not a fun week. When it rains it pours.
While I’m all too aware of the reality that bad things happen in every technology organization, the key is how our teams respond and come back when apps are down. As I mentioned in an article last September, all government operations must be prepared for outages given various scenarios. (Though, I must admit, I never expected to be in this situation eight months later.)
As I have written over the years, this isn’t the first time, and won’t be the last time, that unplanned outages happen. However, the difference between these two outages and situations like the blackout of 2003 is that public perception and expectations are not the same. When large parts of the Northeast USA lost power, the public understood why services were down. But when an outage occurs as a result of internal people, process, or technology failures, all eyes are on your team to get back up quickly and effectively.
Most importantly, I want to thank our recovery teams who did an outstanding job of responding from the moment that these outages were reported. We have an excellent staff that “got going” when the “going got tough.” The communication demonstrated between the technology and business staff was a good sign of successful teamwork. Several of them worked more than 24-hours straight, and I am proud and thankful for their efforts.
More to come on this topic in the future. But for now, I’m hearing my kids laugh again. I’m enjoying the sunshine. I’m smiling again in Lansing. Now I get to mow my lawn.