Complete Recognition: Computers Read Agency Documents

Recognition systems can read documents and forms. Be prepared to pay for accuracy, however.

August 12, 2010 •

Would you trust a computer to read your most precious documents? Surprisingly, some risk-adverse government agencies are relying on computers to tell them what's written on their documents and forms. Tax returns, recreational license applications, police reports, traffic citations and court dockets are just a few examples of the many forms and documents computers are reading these days.

Both the number of agencies using recognition systems and the types of documents computers are reading has steadily grown in recent years. States and localities are taking advantage of improvements in hardware and software that, over the years, have generated significant benefits. For example, a number of state revenue agencies now use recognition software to read tax returns. As a result, they have dramatically reduced processing costs.

How dramatic can it get? A recognition system used in high-volume applications can reduce data-entry costs by as much as 70 percent, according to Arthur Gingrande Jr., a partner with Imerge Consulting. Properly applied, a recognition system can shrink an organization's data-entry labor force by as much as 60 percent.

Recognition technology also delivers indirect benefits to workers, such as a reduction in the number of cases of carpel tunnel and repetitive stress syndrome as well as eyestrain problems. But these benefits don't come easy. The harder it becomes to read the characters on a form or document, the more likely an error will occur. More errors mean more time spent manually correcting what the computer misread. With high error rates, benefits can quickly evaporate. Difficult-to-read documents and forms also require more expensive solutions. If your agency wants computers to read handwriting, expect to pay a bundle to get it done.

ACCEPT NO SUBSTITUTE

Recognition software, referred to as OCR (optical character recognition) and ICR (intelligent character recognition), has steadily grown in importance as a subset of document imaging technology. Both technologies convert visually readable characters into ASCII text, which a computer can store, edit and process. OCR, which was developed first, recognizes type fonts by pattern matching, character assessment and a crude learning process. ICR reads hand printing and, to a lesser extent, handwriting.

OCR, which requires less computing power than ICR to recognize type fonts, can be installed in low-end imaging systems for a few hundred dollars. Customers can purchase special scanners with OCR built in, so recognition takes place as the typed documents are scanned. ICR, on the other hand, requires lots of computing horsepower to recognize and read hand-printed letters and numbers. While some low-end versions of ICR exist, their results can be quite dismal. Apple Computer installed a simple version of ICR on its handheld computer, the Newton, that ended up making recognition software look like a bad joke.

What made some of the Newton's attempts at reading hand-printed characters so funny was something called contextual editing -- one of the many tools employed in recognition systems to bolster the accuracy of OCR and ICR. Accuracy is the holy grail of recognition technology -- impossible to reach, but always sought after.

OCR and ICR software measures accuracy based on the mistakes it knows it made. When the software can't decipher a character, it will highlight the error and, at the end of the job, present the user with the percentage of errors it made. The biggest problem with recognition software lies with substitution errors. These occur when the OCR or ICR software (called the engine) is convinced it has read a character correctly when, in fact, it's wrong. An OCR engine claiming 98 percent accuracy may actually have a true error rate of 93 percent when substitution errors are factored in.

There are three stages in the recognition process that affect accuracy. The prerecognition stage covers everything from the type of paper that will be scanned and the design of the form to the actual scanning process. Colored paper and forms with narrow constraints for hand printing can reduce accuracy. Once the document is scanned, time must be spent removing the paper noise -- those speckles, blotches and indecipherable marks that slow down OCR processing and reduce accuracy. Scanned documents that are skewed are also hard to recognize.

Fortunately, hardware and software have improved significantly to handle most of the problems related to scanning. However, quality control, still a human endeavor, is necessary to correct the worst problems before recognition begins.

The second stage involves the actual recognition process. In the early days of OCR and ICR, systems used one recognition engine. Today, you will find at least three of these engines in a high-volume recognition system. Each engine will attempt to recognize the character and then "vote" on what each character is likely to be. While expensive, voting engines can improve recognition accuracy by up to 50 percent.

Post-recognition is probably the most elaborate stage of the entire process. It's here the system tries to clean up and correct the errors it made during recognition. One process involves contextual editing, which lets the computer check recognized data for correctness. Lookup tables, dictionaries and automated spell checkers are some tools used to trap errors before they get passed to a database.

For example, in forms processing, lookup tables can check addresses, ZIP codes and social security numbers. These validation tools know a ZIP code must have five digits and no letters and can flag those that are wrong. However, a dictionary or spell checker would be useless in validating forms, since most names and addresses are unique.

Once the errors have been flagged, the recognition system can pass the form or document to an operator for manual repair. To speed up this human process, some recognition systems employ special features that display images of the questionable characters in one row and data from the computer's recognition process in another row. The operator simply validates what the computer did and overrides any real errors.

VOTE OF CONFIDENCE

There are literally dozens of recognition systems on the market today, ranging from simple desktop solutions that sell for a few hundred dollars to sophisticated, high-volume systems that cost tens of thousands of dollars. At the low end, Caere and Xerox dominate the PC market. High-end vendors include TopImage Systems, FormWare, Cardiff, Prime Recognition, NCS and Readsoft.

What's important is understanding what your needs are first and then finding the best match. Recognition systems for forms processing differ greatly in scope and cost from recognition systems that read documents to create indexes.

Are these systems worth it? Vendors say steady improvements have turned recognition technology into a beneficial tool. State and local governments appear to believe in the technology. Proof can be found in the growing number of state revenue agencies that trust recognition systems to read tax returns, probably the single most sensitive and important document the public turns over to the government. For years, California's Franchise Tax Board severely limited the use of recognition technology for tax processing. This year, plans are under way to use ICR to read millions of hand-printed tax returns.

April Table of Contents

Tod Newcombe

With more than 20 years of experience covering state and local government, Tod previously was the editor of Public CIO, e.Republic’s award-winning publication for information technology executives in the public sector. He is now a senior editor for Government Technology and a columnist at Governing magazine.

See More Stories by Tod Newcombe

IE 11 Not Supported

Complete Recognition: Computers Read Agency Documents

Recognition systems can read documents and forms. Be prepared to pay for accuracy, however.