Information Explosion Yields Data Nightmare

Let data mining and data warehousing answer the wake-up call.

by / May 31, 1998
Imagine, for a moment, the sum total of all information stored on every computer -- every desktop, mini and mainframe -- in the entire world at this exact point in time. Consider the trillions of gigabytes of information existing as electronic impulses stored in millions of hard drives planetwide. Now, image this universe of data doubled!

According to The Essential Client/Server Survival Guide, 2nd ed., the total quantity of data on computers worldwide doubles every five years. With the widespread use of client/server technologies, including the Internet, expectations are that this doubling factor may soon occur yearly. The sheer quantity of data now being stored digitally is almost unimaginable. The size and scope of a database containing complete information from a single state motor vehicle department, for example, is staggering. The task of enabling users even basic access to such large repositories is challenging to say the least. However, growing requirements for storing, evaluating and analyzing massive data stores have brought about a new technology field -- data mining and data warehousing.

Data Grows With Population

In some instances, increases in state agency databases have been triggered by nontechnical factors. For example, the state of Florida, in general, and Palm Beach County, in particular, have experienced explosive population growth in the last 20 years, according to Roger T. Presas, certified public accountant and business process consultant to the Clerk of the Circuit Court in Palm Beach County. Following this population uptrend, the volume of information being stored by state agencies has rapidly expanded along with the people served. "The need to serve the fast-increasing population coupled with the requirement to improve the cost-effectiveness of governmental services have caused us as public officials to search for new solutions," Presas said.

The search led to the examination of new ways to store and analyze digital data to improve the accuracy, availability and relevance of related information. Initially, the Palm Beach County Circuit Court stored and retrieved information using mainframe technology. As the volume of data and the demand for retrieving it increased, county officials decided that a more flexible solution was needed. A decision was made to search for better tools.

The county targeted the processing of child support information as a specific function requiring better data tools. The Child Support Enforcement (CSE) system, an Informix-based data-mart application, was created to replace a 15-year-old mainframe application developed in-house. The resulting new data-warehousing application solved significant problems, such as the need to increase the turnaround time between receiving and disbursing child support payments.

A second, equally important, requirement was the need to store and retrieve a greater quantity of child support case information required by both the courts responsible for processing child support cases and state and federal agencies. "In addition to meeting this need," commented Presas, "the data-warehouse solution enabled us to achieve more strict compliance with ever-changing legal mandates, reduce costs and increase employee productivity."

Choosing A New Solution

Arriving at such a solution is never an easy task. In this case, the decision-making process was simplified when the clerk of the court initially recognized the need to improve the state's child support data-management operation. This preliminary decision was further supported when the state of Florida mandated development of a new child support enforcement application.

The new application was made available to all court clerks in the state. "Clerks, early on, had concerns regarding the use of a database-based application. Many clerks lacked the technical personnel to undertake a project of this significance and were not familiar with the possible benefits of the solution," Presas said. Consequently, initial acceptance of the new tool was slow.

This was not the case, however, with the clerk of Palm Beach County, where Presas works. The clerk decided to move ahead with the data-warehousing solution. When deployed in June 1997, Palm Beach County's approach proved to be the most successful implementation of the CSE application in the state.

Architecture, Deployment and Benefits

The CSE application maintains information required to process over 35,000 active child support cases in Palm Beach County. Data is stored and managed using Informix OnLine 7.23 running on an IBM RS/6000 SP 2 UNIX (AIX) server. All programs are written in the Informix "4GL" language. Hundreds of users access the application using a wide area network running TCP/IP. Users include the clerk's staff directly responsible to operate child support, judges hearing support cases in courtrooms, the public, and state and federal agencies. Interested parties can access payment information by calling an integrated voice response server running Edify software that connects directly to the database.

Data conversion and migration from the mainframe-based application to the CSE system required significant efforts. The structure of mainframe information was completely different from the design of the new data warehouse. Additionally, data elements -- like codes, indicators and similar items -- had to be converted to the new format. The critical nature of the information made these tasks more challenging. Data-conversion programs were developed and tested for extended periods until everyone involved was assured of the accuracy of the results. At the same time, the application was ported from the original Unisys 6000 platform to operate on an IBM server. This effort was undertaken to meet requirements for effective response to hundreds of concurrent user inquiries.

Once deployed, the increased efficiency of the CSE application allowed clerk management staff to reallocate human resources to other functions. This has resulted in estimated annual savings of over $200,000. Hardware and software maintenance costs have also been reduced but have not been specifically quantified. "Other benefits of the new system," stated Presas, "include increased payment turnaround, improved information flow to the courts, state and federal agencies and improved enforcement due to the availability of more timely, detailed information."

Mining For Fraud

How about information stored by major insurance companies for a daunting dataset? As insurance company information stores grow to unwieldy proportions, data-mining techniques are becoming an increasingly important weapon for companies trying to fight all sorts of fraud.

According to a report from the Newsbytes wire service, medical insurance fraud recently made headlines due to the use of data-mining technology, which helped insurance investigators ferret out a scheme in which fictitious companies used the names of real doctors and patients to bill for services that were never provided.

Joyce Hansen, vice president of Integrity Plus Services in Minneapolis, told Newsbytes that Integrity Plus, an insurance fraud detection company, has been using IBM's Fraud and Abuse Management System to catch many forms of fraud. Integrity Plus has caught bills for services supposedly provided on Sundays and holidays, for clinics claiming to serve patients who live far away, and so on, Hansen said. While it is difficult to say exactly how much the system saves, Hansen said that in the first year of its use, the claims savings from catching fraudulent billing increased 20 percent.

According to an IBM technician, the system, designed in consultation with several customers in the insurance industry, looks at about 100 different claim characteristics to spot abnormal patterns that might suggest fraud. It might identify, for instance, the fact that a particular ambulance operator consistently claims longer runs than others in the same area. When the system spots a suspicious trend such as this, investigators can take a closer look.

Ben Barnes, general manager of global business intelligence solutions at IBM, admitted that the system usually cannot work fast enough to pre-screen claims, so when fraud is caught, the insurer may have to take legal action to recover money already paid. However, the service provider who has been caught once will be watched more closely in the future.

The fact technology exists to analyze claims looking for fraud should deter some would-be fraudsters. Others will try to outsmart the system, and Hansen said that is already happening as the perpetrators of fraud change their behavior in attempts to avoid detection. "They're learning those controls, and so they can bypass them," she said. However, she added that detection technology will continue improving to stay ahead of the fraud attempts.

Planning For The Future

The only thing one can say for certain is that information stores will continue to grow. Impossibly huge databases and the demands of an increasingly sophisticated user base will continue to challenge data managers. Fortunately, experience to date suggests that the industry will follow that growth trend and endeavor to provide more robust tools to facilitate the management and examination of these digital mountains.

A new "sub-industry" of data warehousing and data mining has sprung up almost overnight to meet the demand. Government agencies will turn to these solutions more to meet increased demand from both internal and public users. Technology advances in data storage, transfer (for data warehousing) and artificial intelligence (for data mining) will make the job easier moving forward.

A note of caution: Data managers would be wise to carefully examine all the elements of a proposed solution to ensure it will be compatible with existing infrastructures and those of related agencies and organizations. One's ability to extend and evolve the application down the road is also important.

As with any new technology arena, different vendors will promote different proprietary approaches to the problem. To the extent possible, try to implement a solution that will evolve with future demands. Make the investment in this new technology an investment for the future and not a "one-shot" solution likely to be rendered obsolete with the next wave of technological change.

Developing Definitions

While these areas are still developing, a working set of definitions is necessary to understand what data mining and data warehousing are and what they are trying to do. The following have been compiled from a variety of sources and seem to be currently agreed-upon descriptions (Note: However, like any new technology, these definitions are subject to change as things develop.)

Data Warehousing -- A collection of data designed to support management decision-making. Data warehouses contain a wide variety of data that presents a coherent picture of business conditions at a single point in time. Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data. (Courtesy of ZDNet's Webop