Search Tools for Information Quests

A new generation of search and retrieval tools have improved the way text documents can be found in computers. Some software can even look for images.

by / July 31, 1995 0
Aug 95

Level of Govt: State, local

Function: Text retrieval; Document Management

Problem/situation: Government agencies need faster, more accurate ways to get at information stored in computer files.

Solution: Full text search and retrieval engines.

Jurisdiction: San Bernardino County, Calif., Municipal Water District

Vendors: Datapro Information Services Group, Compuserve, Prodigy, America Online, Excalibur Technologies Corp., Fulcrum Technologies, Information Dimensions Inc. (IDI), Personal Library Software Inc., Verity Inc., ZyLAB, Delphi Consulting Group, DEC, Apple, Microsoft

Contact: Karen Shegda, Datapro Information Services Group 609/764-0100; Robert Tincher, San Bernardino Valley Municipal Water District, 909/387-9244.

By Tod Newcombe

Contributing Editor

For users of imaging applications involving complex documents, indexing has always been a problem. Someone with knowledge about the document's subject matter had to analyze the scanned images and come up with a series of keyword identifiers. Despite the time and resources spent on indexing, users had no way of knowing whether they would always find everything they were looking for.

To both automate the laborious process of indexing and increase the accuracy of searches, imaging users have turned to full-text search and retrieval technology. Once a mainframe tool, search and retrieval has migrated to client/server and PC-based applications. Instead of someone manually entering keywords into an index database, document images are converted into text using optical character recognition (OCR) technology. Text retrieval software then converts the entire text file into an index, allowing users to find a document by searching for any word that appears in the text.

Today's search and retrieval tools can even compensate for bad spellers. Using a technology called fuzzy searching, the software reduces a query term to its root form, making it possible to locate words spelled several different ways. The software will even rank the relevance of documents it finds by the number of times the root word appears. Taking the same approach one step further, search and retrieval is going beyond text searching and is being applied to other forms of information, including digital images, video and sound.

"In the near future, we're going to see more information retrieval as opposed to just text," predicted Karen Shegda, associate managing editor for Datapro Information Services Group. As for search and retrieval today, Shegda said much has improved. "The products have gotten easier to use and the software is much more sophisticated." She mentioned that the leading products use pattern recognition, algorithmic and statistical systems that simplify queries while improving the precision of each search.

Search and retrieval technology has been around for at least 25 years. For most of that time, it has been used in mainframe and minicomputer applications. Today, the technology also runs on stand-alone PCs, workstations and client/server systems.

It is also entering new markets, thanks to the recent growth in electronic publishing, CD-ROM, document imaging, and information superhighway services, such as the Internet, Compuserve, Prodigy and America Online. With their huge databases and millions of customers, these services need search tools that are fast and easy to use.

Also aiding the growth of the search and retrieval market is the falling costs of mass storage for computers. As a result, corporate America and government are storing more information electronically than ever before.

Where once just a few vendors served the market, today there are many. Some of the leading names include: Excalibur Technologies Corp., Fulcrum Technologies, Information Dimensions Inc. (IDI), Personal Library Software Inc., Verity Inc., and ZyLAB. Typically, these vendors sell their search "engines" to other vendors, such as an imaging software developer, who integrates the engine into an imaging system.

According to Shegda, a full-fledged, sophisticated text retrieval system can cost under $500 per user in a network. Most packages range from $395 to $995 for a single user. The majority of the retrieval packages run under the Windows interface; a smaller number are available on the Macintosh operating system.

Text retrieval tools offer users several different ways to find the file or document they are looking for. They can search by keyword, which automatically retrieves exact matches. This is the simplest form of indexing and retrieving files. However, keywords searches require someone to define the relevant terms and assign them to the file. Incorrect keywords or the occasional keyword with dual meanings can reduce the accuracy of these kinds of searches.

The Boolean system, which was designed to overcome the shortcomings of keyword searches, relies on indexes that identify every word in a document. Users have the advantage of running queries on a large group of documents using the Boolean system. However, users often find Boolean searches can either retrieve too many or too few documents.

Another search technique involves statistical systems, which rely on algorithms to determine a document's relevance according to the frequency with which a keyword appears in the document. Taking statistical searches one step further, concept-based searching - sometimes referred to as natural-language searching - allows users to create hierarchies for search terms. For example, the concept term "computer" might retrieve all documents that refer to PCs, mainframes and workstations. When used in combination with statistics, concept searches can be extremely useful.

For scanned documents stored in imaging systems, full-text retrieval engines with "fuzzy searching" capabilities work best at overcoming problems created by OCR when it converts document images into text. Despite advancements in the technology, OCR is still finicky and can misspell words. According to Excalibur Technologies - a developer of text retrieval software - fuzzy searching can speed up searches by enabling people to find information even if documents are misfiled or if words in an index are misspelled.

Fuzzy searching is based on adaptive pattern technology, a form of intelligent software that allows users to index and retrieve documents based on repeating patterns in data. In essence, the technology allows people to ask a computer, "Have you seen anything that resembles this?" The object of a fuzzy search doesn't have to be words. It can also search for pictures, a video clip, a fingerprint or any other type of digital data, according to Excalibur.


Market revenues for the text retrieval industry are expected to hit $552 million in 1995, according to Delphi Consulting Group, a document management firm. Government represents almost one third of that market, but most of that share belongs to the federal sector. State and local government has only a five percent segment of the market, according to Delphi, but if the education and library market is included, then the non-federal government share rises to 11 percent.

According to Shegda, common text retrieval applications range from litigation support in the legal field to customer service, such as correspondence tracking, to technical document management. In the government sector, text retrieval has helped agencies deal with the administrative burdens of regulatory compliance and corporate filings, such as the Uniform Commercial Code. The technology has also been useful in the field of legislative support.


The San Bernardino Valley Municipal Water District is typical of government agencies that need and use text retrieval. It also represents where state and local governments are headed with the leading-edge technology.

The District is a water wholesaler, providing supplemental water to what Robert Tincher calls its "retail market": the cities of San Bernardino Valley, Calif.. Tincher, who is water resource manager for the District, said the agency serves more than 600,000 people and operates on a $20 million annual budget. Running the entire operation are just 12 staff people.

Like any other bureaucracy, the agency is awash in paper. "We receive and produce lots of paper and have no efficient way to get at the information stored on the documents," commented Tincher. And like a growing number of government agencies, the Water District has reduced its support staff and increasingly relies on computers to do its office work.

A key piece of technology is the document imaging system they use for storing, filing, indexing and retrieving all project files, correspondence, legal documents and financial records. The District uses a hardware platform from Digital Equipment and Excalibur software for document imaging, indexing and retrieval.

According to Tincher, the system provides the staff with the means to retrieve documents by keyword, Boolean or fuzzy searches. "The fuzzy search capability takes spelling out of the loop," he explained. When a search is entered into the computer, the system provides the user with a list of hits and ranks them according to which document had the most keyword occurrences. When a user selects an item from the retrieval list, the software brings the document image - not the text file - to the screen.

"The Excalibur system is excellent for our small office, where we can't afford a file clerk," said Tincher. He added that the software finds and retrieves documents very quickly and is easy to operate for casual users.


With more government information being stored electronically, the ability to find files and information becomes more difficult. That makes text retrieval more important than ever. But choosing and integrating retrieval software into new or existing applications requires careful consideration. Shegda cited three key issues that must be addressed: standards, storage requirements and databases.

"You need to make sure that the search engine you choose can read the native file formats you use, such as Microsoft Word or Word Perfect," she said. "Some only provide limited support. If the search engine doesn't support one of your standard formats, then the file has to be translated into ASCII text before it can be indexed and searched," Shegda pointed out.

Storage requirements must be analyzed carefully, because indexes for full text retrieval require large amounts of storage space. When text retrieval is integrated with an imaging system, storage requirements can be considerable. These integrated systems typically have two databases: one for the full-text database and another containing keywords and pointers to the document image files. Search and retrieval times can slow down when the databases are large.


Where's text retrieval headed? Well, it probably won't be called "text" retrieval in about five years, according to Shegda. By then, retrieval technology - combined with more mature adaptive pattern recognition capabilities - will be more active in the multimedia environment. Also, retrieval tools will be more of a commodity item, embedded in word processing software or even operating systems, making them accessible to virtually all computer users. That's good news for government information gatherers.