Big Data, Open Data and the Need for Data Transparency (Industry Perspective)

Open data is only as good as the data analytics platforms and true data transparency policies on which it relies.

by Eddie Garcia / October 31, 2016

Data is part of everything we do, especially given the current open data movement. From financial market performance to farmer’s market locations, weather to health care, bridge and road safety to population information, significant amounts of data are yielded and available for aggregation and analysis, and can be applied to improve public services. This is the philosophy behind the open data movement —that if we make all of this data available to the public, at least the high-value data, we can crowdsource public service issues and come up with the best possible solutions. 

But open data is only as good as the data analytics platforms and true data transparency policies on which it relies. Bringing big data, open data and data transparency together empowers data to solve some of the world’s most challenging problems. Over the last few years alone, I have seen big data used to reduce sepsis, understand Parkinson’s disease, combat child sex trafficking and fight Ebola, among many other noble causes. 

One such cause was the University of Texas at Austin (UT) and the Texas Advanced Computing Center collaborating on two Data for Good hackathons to develop solutions to prevent, detect, fight and reduce mosquito-transmitted diseases, including the Zika virus. More than 120 data scientists, engineers and UT Austin students brought together their diverse set of skills to apply data to a number of questions related to the virus. 

In the first hackathon, solutions involved scraping outbreak data from the Centers for Disease Control and Prevention (CDC), the World Health Organization (WHO) and other similar organizations around the world. These “hackers” paired the data with news reports and social media feeds to create visualizations of the quick progression of Zika cases across Central America in an effort to learn more about the Aedes mosquito and the spread of the virus. One participant presented a method to detect stagnant bodies of water and differentiate between green or brown bodies of water from a clear pool or a stream in hopes of identifying prime mosquito breeding grounds. Others worked on a data collection mobile app that allows people to quickly and easily report potential Zika cases and symptoms on their mobile device. 

At the second hackathon, projects focused on research in the clinical and epidemiological areas of Zika. Projects ranged from identifying Zika in water samples using metagenomic data to exploring the Zika protein and docking to identify potential drugs to fight infections. The identification of Zika in publicly available water sample data was a huge discovery — and proof that these projects have the potential of making a significant scientific impact. These projects are hopefully the seed to future discoveries or insights.

The goal of the events was not to specifically find a cure for Zika. Rather, it was to highlight and work through the challenges involved in finding, creating and socializing open data sets for research and social good. The hope is that this is not a one-time event, but a model that can be replicated for future research, data sharing and continual collaboration between industry, academia and data citizens. In fact, one challenge identified was accessing some data sets from a few global health organizations. And this was a good one to discover, because this data is already supposed to be open.

In 2013, the White House issued an Open Data Policy, directing agencies to treat data as an asset, namely by making it “open and machine readable” so the public can access and use it. Project Open Data, a collection of code, tools and case studies, was subsequently launched to help agencies adopt the Open Data Policy. Now there are nearly 200,000 data sets, tools and resources publicly available and free on While this all sounds great, there is a problem that needs to be addressed for this data to be more consumable. In some cases, the data sets are PDF documents, which are not inherently readable. Data should be shared in formats that are more easy to consume — at a minimum, comma separated files and preferably in formats that support metadata and nesting like JSON, XML or, better yet, Apache Avro and Apache Parquet for easy ingestion into Hadoop.

For researchers, public servants, data scientists and citizens to use data to spot problems and find solutions, they need big data tools to scrub and manage the data sets from their original format so they can perform investigations and operations with them. This is where the policy objectives of open data, government transparency and the technological power of big data come together. In addition, policies need to support the technology requirements. For example, while the current open data and government transparency policies allow for PDF documents to count as open data, in some cases, not all data analytics platforms can read PDF files. 

The government transparency and open data movements both have the honorable goal of making data available to anyone who wants it. This isn’t just in recognition of the need to better govern, but a real step toward creating a better world. Certainly open data will improve government, but it also empowers citizens and builds economic value, not just by monetizing a resource we already have, but also by all the opportunity created from the intelligence that it enables. While the policies have paved the way, only together with big data technology will open data realize its full potential. Big data and open data are a powerful combination that can make a positive change in this world.

Eddie Garcia is chief security architect for Cloudera.