Clickability tracking pixel

Data Mining Techniques Must Adapt With Web 2.0

A panel discusses how governments need to prepare for a whole new way of processing and using the large volumes of unstructured data available to them.

by / June 1, 2012
Tom Deutsch speaking at the GTC West conference on May 30, at the Sacramento Convention Center. Photo by Jessica Mulholland. Photo by Jessica Mulholland

SACRAMENTO, Calif. — Is your government agency struggling to get a handle on data mining? If so, representatives from IBM and Splunk have a few tips to help make better sense of unstructured data and how to use it effectively.

When dealing with unstructured sources of data — social media feeds, blogs, chat logs, wikis, and other items that sometimes aren’t stored in a database — IT leaders need to accept that traditional methods of information extraction won’t always work, said Tom Deutsch, program director of Big Data for IBM. A variety of new tools have to be learned in order to siphon useful records, he said.

Structured vs. Unstructured

Structured data is organized so that it is easily identifiable, such as a database from Microsoft Access or Structured Query Language (SQL).

In contrast, unstructured data is information that doesn’t have a predefined format and may not fit into existing database fields.

Speaking at the annual GTC West conference on Wednesday, May 30, Deutsch explained that in the past, data mining techniques mostly ignored the majority of information being searched. These data mining techniques typically sought specific statistics, such as word count or proximity of words. That extraction method generally worked well for longer items, such as complex manifestos or engineering documents.

But it doesn’t work as well for today’s communications, which are typically shorter and more compact.

Deutsch said agencies should turn their attention to natural language processing (NLP) technology — a combination of linguistics and artificial intelligence — to extract meaningful information from a source. By using NLP, an organization can understand not just what was said, but who said it and all of the people involved in the information that was extracted.

“The model today is we force people to learn the tools that then interpret the results,” Deutsch said. “One of the models going forward is that the tools should be able to understand the native expression of ‘what is the cost,’ ‘who did this,’ ‘what is likely to happen’ — those types of interrogating systems.”

In addition, Deutsch said he’s in favor of agencies embracing “fit for purpose” computing architectures. For the most part today, users try to figure out how to mold a data set into something that can be stored in a structured database. While that approach has been rational during the past 30 years, it’s now not the only option available.

So instead of relying on a vendor’s definition of what “big data” is and what computing notions are needed to examine it, Deutsch said agencies should be very clear with who they are working with that their technology systems must remain open and flexible to embrace an open and “pluggable” approach to data mining.

Machine Data a ‘Gold Mine’

Joining the discussion was Tapan Bhatt, senior director of solutions marketing, Splunk Inc., a vendor that provides technology for making machine-generated data accessible and usable. Machine data — information created solely by a computing process — is entirely unstructured, except for the fact it has a time stamp on every bit of information.

For example, when someone logs into a system, browses a website, or clicks into a document or mobile application, that data is collected and stamped with a time. Bhatt called that information a “gold mine” for IT and business operations, but said it isn’t being leveraged as effectively as it could be.

NASA is using machine data from satellite information to increase its security posture and private companies such as Expedia are using machine data to improve services. But if all levels of government start embracing machine data, Bhatt said it could help improve the security of sensitive information and provide more knowledge to decision-makers.

Bhatt explained there are several reasons why machine-generated information isn’t being used in a more widespread fashion. The issues are centered around three distinct areas: volume, variety and velocity.

In many cases, machine-generated data sources are high-volume, and many government agencies may not yet have the ability to look at the tremendous amount of information and make sense of it. In addition, because machine data comes from so a variety of sources, there is no standard format or definition of the information, which makes usage problematic. Finally, there is the sheer speed in which the volume of data comes into an agency. For instance, when a natural disaster occurs, more people visit a website to find out what’ s going on — which in itself rapidly creates more data.

“When you think about what people traditionally use — databases, business intelligence systems, data warehouses — all these systems are meant to address structured data,” Bhatt said. “None of these systems can address the complexity, the variety and the volume associated with big data sources. They just aren’t designed to do that. That’s why you’re seeing interest in big data technologies.”


Looking for the latest gov tech news as it happens? Subscribe to GT newsletters.

Brian Heaton

Brian Heaton was a writer for Government Technology magazine from 2011 to mid-2015.

E.REPUBLIC Platforms & Programs