August 1, 2012 By Jessica Meyer Maria
In late April, more than 200 data scientists from more than 10 cities around the world spent 24 hours in London designing solutions to help improve the U.S. Environmental Protection Agency’s (EPA) Air Quality Index.
People who suffer from asthma and other respiratory diseases use the index to avoid dangerous levels of attack-triggering outdoor air pollutants, and the hackathon’s goal was to help build local early warning systems to accurately predict dangerous levels of pollutants on an hourly basis.
“We wanted to use open data, but more than that, we wanted data that was meaningful in terms of social change or influence,” said Carlos Somohano, a data scientist for Data Science London (DSL), about the choice to use an environmental data set from Cook County, Ill.
The Data Science Hackathon was created and hosted by DSL and Data Science Global in collaboration with Kaggle, a platform for predictive modeling and analytics competitions. The activities were part of Big Data Week, a series of community led events and hackathons involving big data.
David Chudzicki, data scientist for Kaggle, said Chicago’s thriving data science and machine learning community was involved in the event from early on. “Cook County is making a big drive toward open gov data,” he said, “so the collaboration with them providing the data set occurred quite naturally as we were searching for a good problem for the hackathon.”
If the hackathon can contribute to positive health-care outcomes, then the event will prove more than worthwhile, said Chris Roche, regional director for Greenplum, a division of EMC, which sponsored the event. “What I like about the hackathon and the data science community is the accelerated innovation that they create.”
On the whole, the competition led to some great insights into the problem and started people looking at this type of data, Chudzicki said. Cash prizes totaling 3,000 pounds (approximately $4,700) were awarded between a global winner and a London-based winner.
The winning solutions were submitted and ranked through Kaggle’s competition platform that provided real-time leader boards, allowing participants to continuously keep track of their scores.
Though the top winner, Ben Hamner, was ineligible for any prize money as a Kaggle-employed data scientist, his solution is notable in that he claims to have barely glanced at the domain before training the model — meaning he could devise a winning solution without knowing anything about the actual issues going on in Cook County. To him, he was working with truly random data.
“I was surprised that domain insight wasn’t necessary to win the hackathon,” Hamner said. “Key insights have been crucial in many of our longer-running competitions.”
While it’s too early to know what his solution could mean for Cook County, the EPA and citizens who follow the Air Quality Index, the solution is now undergoing a period of thorough exploration and development.
Melbourne’s James Petterson won the global first prize and, like Hamner, spent little time looking at the data itself. He said he was surprised to achieve such a high-quality result without having spent time trying to understand the data set.
“If you’re a data scientist, let the data talk,” said DSL’s Somohamo. “You don’t have to be a domain expert. The competition proves that a good data scientist doesn’t have to know the domain context to achieve results.”
The code for both winning models discussed above, as well as that of the local London winner, has been made publicly available by Kaggle and Data Science London, meaning it’s accessible to anyone who wants to explore it and continue working on it. Development may well continue outside the expected channels.
Currently, predictive models drafted at the hackathon are being reviewed to determine their relevance at the local, state, U.S. EPA and National Weather Service levels. “We’re looking at who is most appropriate to use this,” said Cook County CIO Greg Wass. “Once these solutions are refined, they may go up the chain. We’ll see how far we get with this thing.”
“One of our missions is to promote awareness of data science and the dissemination of data science knowledge,” Somohano said. “It’s a new thing here in the UK, but in the U.S., it’s already getting quite trendy.”
The best way to raise awareness and involve local and international data science communities was through a hackathon, determined Somohano and his DSL partner Stewart Townsend.
“The concept of a ‘hackathon’ has deep roots in Silicon Valley as an event that combines innovation and competition in a very short, intense period of time,” Chudzicki said. “While the term ‘hacker’ has negative connotations from being used to describe computer security crackers, the meaning in the community is someone who delights in solving problems and building new things.”
The EPA data set was chosen because air pollution affects people regardless of their location, even if the specific data used in the competition was sourced from one U.S. city.
“We worked in partnership with Big Data Chicago to make this happen and to share our environmental data sets,” said Cook County Deputy Director of New Media Sebastian James. “We were asked if we could get the specific data about air quality to the event organizers. They needed a big data set to work with, something that served a public need and was very topical.”
As Data Science Global organizes and promotes future events, DSL’s Somohano said that other subject matter and data of high relevance to government — such as health care — will be the objective.
Health-care provision is one of the major concerns of governments worldwide, said Greenplum’s Roche. “Serious respiratory disease affects over 700 million people globally and chronic disease accounts for over 80 percent of all primary care consultations.”
You may use or reference this story with attribution and a link to