IE 11 Not Supported

For optimal browsing, we recommend Chrome, Firefox or Safari browsers.

Why Data Lakes are Essential for Public Health Services

State of Minnesota’s first data lake integrates and reports on disparate data sets to fuel operational improvements in the state public health system.

Public Health. Medical Concept with Blurred Text, Stethoscope, Pills and Syringe on Grey Background. Selective Focus.
Public health agencies worldwide faced the same challenge early in 2020: data flow on COVID-19 cases went from a trickle to a flood to a tsunami in the span of weeks.

The United States, for instance, saw confirmed COVID cases rocket from fewer than 100 on March 1, 2020, to more than 215,000 a month later.1 These numbers soon burst into the millions and beyond, straining the data management resources of agencies responding to the pandemic.

One of those agencies was the Minnesota Department of Health (MDH). Facing an influx of new COVID-19 data to monitor, manage, and analyze, MDH implemented a data lake to optimize its pandemic response. The agency’s journey to a data lake solution — grappling with complex challenges, choosing cloud technologies, and assessing the results — illustrates how the cloud gave many agencies the power to respond quickly to the pandemic. Moreover, the lessons MDH learned along the way set the stage for grappling with future public health emergencies.

A recent Government Technology webinar explored Minnesota’s experience in ramping up a large-scale public health response in record time. Drawing on the experience of experts from MDH, Minnesota IT (MNIT) Services, and Amazon Web Services (AWS), the webinar detailed the difficulties confronting public agencies in a health emergency and explained why a data lake offers an attractive solution. The results of MDH’s efforts underscore the value of data lakes and cloud tools to address public health emergencies.

The challenge: Managing vast volumes of unstructured and structured data

Stephanie Meyer, supervisor of the COVID Epidemiology and Data Unit at MDH, was in uncharted territory in the first half of 2020. “The number of records coming in just went through the roof,” said Meyer. The records, which identified every new confirmed case of COVID-19, emerged from a process called infectious disease surveillance, which uses information from medical providers to identify instances of infectious disease. Disease surveillance is one of MDH’s essential tools to protect the health of Minnesota’s 5.7 million residents. MDH received its first report of COVID-19 cases in March 2020 and cases swelled to 918 by month’s end — roughly a year’s worth of surveillance for all other reportable diseases in Minnesota.

By April 30, 2020, MDH had added more than 8,000 positive COVID reports, about eight years of data for all other reportable diseases. As the virus infected thousands more Minnesotans, COVID testing would have to increase to keep pace. The rising tide of new tests put even more pressure on MDH. Everyone needed information about the situation as soon as possible, creating a stronger incentive to acquire tools to manage and process enormous volumes of data.

COVID-19 inundated MDH with unstructured and structured data from multiple sources. Insights from this data had to be correlated with structured and semi-structured data. MDH needed robust, real-time reporting to answer the most pressing pandemic questions: Who was getting sick? What was the positivity rate? Who was being hospitalized? Did hospitals have enough intensive care capacity? Which treatments showed the most promise? Which demographics faced the greatest danger?

As the pandemic unfolded, several Minnesota teams tracked this data manually. “We literally had case data glued to the wall in a conference room,” Meyer said. But they soon ran out of available people and data management capacity. “We couldn’t keep up,” she added.

The switch to remote work complicated matters. “We definitely did not telework prior to this,” Meyer said. Moreover, MDH teams had to elevate their data management capabilities to mount a statewide pandemic response.

The agency’s existing database and technology stack was an obstacle course. Meyer’s people knew how to create simple databases, for instance, but combining disparate data sets and using reporting to draw real-time insights was another matter altogether. MDH had database talent, but it needed scale, automation, and speed.

“Epidemiologists are historically known to just spin up our own little data sets and manage them without draining a lot of resources away from our IT staff,” Meyer said. “But these larger investigations like a pandemic involve more of our larger disease surveillance tools that have IT infrastructure and backing.”

The agency used two existing IT systems for disease surveillance and immunization records as their primary tools. “These are independent systems that were never really meant to talk to each other,” Meyer said. “You don’t look up a single record and find everything about a person in one place. That’s just not how they’re designed.”

The solution: Implementing a data lake in the cloud

To cope with the challenges of the COVID-19 pandemic, MDH leaders had to decide whether to build a data warehouse or data lake. Data warehouses are rigid and store structured data. Data lakes are fluid and store all kinds of data. With so many sources to process, MDH decided to build a data lake.

Plus, data lakes have another advantage. “[They provide] a very large data set for things like machine learning or artificial intelligence to learn from,” said Betsy Baker, modernization lead for state and local government with Amazon Web Services (AWS), who also spoke on the webinar.

AWS provides a tool called Athena that uses SQL for deep dives into a data lake’s business insights. Other enterprise-level reporting tools help manage and process the contents of a data lake. For instance, Amazon QuickSight offers real-time analytics visualizations and custom dashboards, helping IT leaders gain insights that drive better decision-making.

These tools make life easier for IT leaders. Tuning them for maximum advantage, however, poses difficult questions. “How can we harness this in a way that’s efficient, usable, accessible, and reproducible?” asked Steve Gorg, a data and cloud initiative supervisor with Minnesota IT (MNIT) Services who partnered with MDH and AWS to design and implement the data lake.

The pandemic created an optimum scenario to answer these queries with a data lake.

The results: A single point of truth for data and applications

MDH data sources are now centralized in a data lake hosted on AWS. This strengthens the agency’s ability to understand the changing nature of the pandemic.

“As the pandemic continued, there was more focus, both in the media and in public health in general, on mutations of the virus variants,” Meyer said. The agency’s many laboratory partners provided a stream of data informing how the virus was shifting over time.

The agency’s developers built an app to stream each partner’s data into a new home in the data lake, making it easier to retrieve the data and comb it for insights. With previous data technologies, the agency would have considerable difficulty formatting data from multiple sources and then pulling it in. “But you can do that with our data lake technology and reporting out from the lake,” she said. “And we were also able to use it with both variant and vaccine breakthrough data.”

Gorg summarized, “It has proven to be a phenomenal reporting platform as a single source of truth for our data, that we simply go to one location for our data, and we no longer have to go through all these different hoops to get access to different data sources.”

The data is available in one format, he added. There’s no need for special tools to access it. This allows MDH to accelerate the speed and efficiency of data availability by automating the data coming into a data lake environment. If more storage and computing resources are needed, it’s easy to acquire more in a short time frame.

Baker put it this way: “The nice thing is that we could fail fast or succeed quick and be able to really prove out any use cases that were needed in order to meet the very, very quickly changing needs of the pandemic response.”