A first-person account of losing a data center to an electrical fire, and the heroic actions that restored services within 12 hours.
In a world where many of us deal with the same issues and challenges -- but aren't as collaborative and open as we could be -- we offer this glimpse of "inside baseball." Here's what happened and how we responded when an electrical fire took down our primary data center last month.
Tuesday, Feb. 18 was an average day. The Iowa legislature was in session, we had multiple operations and initiatives in various stages of completion and staffing was normal. Payroll processing for state employees was scheduled that evening. The weather forecasts indicated severe blizzard and travel warnings within 48 hours, so we were considering coverage and contingencies to ensure technical resources were available during the next few days.
Shortly after 3 p.m., the building lost power and evacuation alarms sounded. We'd been through these drills before; this was probably another one. But as staff assembled in their designated areas, we starting receiving personal accounts of fire, smoke and percussion noise. Matt Behrens, chief operations officer for the Information Technology Enterprise of Administrative Services (DAS-ITE), took attendance of the evacuees and heard first hand reports of the fire. We coordinated and confirmed preliminary reports.
Within 30 minutes of confirming those reports -- and in conjunction with the DAS public information officer -- I briefed the governor’s chief of staff, director of management, and governor’s spokesperson. New reporters were already close by because of the legislative session, so they were briefed within a few minutes. The governor, lieutenant governor and staff were aware and extremely supportive, offering whatever help we needed.
Our attention focused next on assessment. We assembled our technology response team -- they were already evacuated -- in a nearby building to begin our command-and-control response. While Matt was getting the team in place, I was strongly encouraging the fire department and police to allow access to the data center, but this took a while.
Matt set up conference call schedules with all agencies IT staff, and we completed the first round of updates by 4:45 p.m. By 5 p.m., Matt had teams broken up into work streams and had completed preliminary sequencing of restoration processes.
Shortly after 5 p.m., I was escorted into the data center with our top-notch general services staff by the fire department. General services quickly identified the source of the fire -- a wall-mounted electrical suppression unit. The smell from the FM-200 fire suppression discharge was incredibly pungent. Since all power was off, the first issues were restoring power (and bypassing the failure point) and venting the data center. This took some engineering because the air conditioning chillers were on the same emergency power shut off as all of the other equipment in the center.
General services bypassed the failed electrical unit and restored power, and they adjusted circuits to allow for exhaust and venting. We partnered with general services staff in reviewing initial damage, cleanup efforts, fire watch controls, and -- importantly -- physical access controls as we required the doors to be opened for venting and needed to position staff there to control physical access.
A primary decision was whether to restore to the state's secondary data center (per our emergency plan), or attempt to restore the primary center. I decided to restore to the primary center. The cost, time and risk of each option were substantial considerations. Our history revealed good data on the trials and tribulations associated with the secondary center and the subsequent time and workload associated with failing back over to the primary after a crisis. Although we have good experience and success going to the secondary center, I felt we could save time and money by giving the primary center a shot first.
Another factor in my decision was the desire to avoid idling state staff for a long period of time – that cost far outweighed the risk of staying with the primary center. (Although staying with the primary center risked potential impact on the expensive Uninterruptable Power Supply (UPS) and the electronics that were now exposed without the electrical suppression unit.)
By 6 p.m., we had staff at the secondary data center and ready to go. At the same time general services had completed all power and elevator work at the primary data center.
Meanwhile, pressure was intensifying. Human resources needed to know if it was safe occupy the building that houses the data center (about 1,000 employees were affected) and whether staff would be able to use technology the next day. In addition, payroll processing hadn't started, and direct deposits needed to be processed.
Other agencies had high risk, too. The Department of Transportation needed its cameras, especially with the impending weather. The Department of Revenue needed to process tax collections. The Justice Department needed to process claims and fee payments. Accounting needed to process $162 million in payments (including direct deposits), state websites were not available to citizens and businesses.
In addition, rumors were flying about. But our team remained focused on restoring essential government functions for taxpayers.
The building and the data center was deemed safe at approximately 6:30 p.m. Tuesday evening. IT response teams were dispatched to the data center to begin monitoring progress. In partnership with Iowa's Homeland Security team, we leveraged voice alert notification and sent a status update to agency directors.
Communications via conference call continued through Wednesday, Feb. 19. We also leveraged our Homeland Security’s voice notification system to update agency directors and key staff twice during the event. Leadership continued to be informed throughout the event.
Restoring systems in the data center was predicated on interdependencies as well as priorities. Here's a timeline of when systems/functions were restored (not inclusive):
9 p.m. Feb. 18 (six hours after the event)
Data center cleaned of residue. Storage attached network, firewalls and network core restored.
11 p.m. Feb 18
Service desk, DOT cameras, virtual machines, financial systems and additional storage systems restored.
2 a.m. Feb. 19
SQL, mainframe, email, tape library, justice systems, additional firewalls, remaining DNS, Web email, authorization and authentication, major websites and agency systems restored.
3 a.m. Feb. 19 (12 hours after the event)
Print services, federal systems interfaces, additional justice systems, multiple agency applications and remaining major agency applications restored.
7 a.m. Feb. 19
Final event conference call with agencies and interested others. Outages and service calls routed through normal systems and processes.
On Feb. 19, we partnered with general services to begin scheduling and planning the recharge of the fire suppression system and replacement of the failed electrical suppressor. We also began collecting logs and team minutes for after-actions, event costs and ensuring personnel were migrated carefully back to normal shift operations.
As I write this several days later, we expect to be crawling through the detailed restoration sequence, documentation, and lessons learned for a few more days.
Some initial lessons:
If we were in private industry instead of government, I would have already provided bonuses to Matt and his core team that saved this enterprise millions of dollars, restored confidence, and kept this CIO in service to the governor of the state of Iowa.