No-Translation Internet Publishing

Washington State's Department of Information Services put data on the Internet using existing data formats to spare time and expense.

by / April 30, 1996
PROBLEM/SITUATION: How to put government data on the Internet without translating it into some consistent format.
SOLUTION: Server with Web browser interface and text retrieval engine.
JURISDICTION: Washington State, Washington Office of the Administrator of the Courts, Washington Department of Social and Health Services.
VENDORS: Apple Computer, Microsoft, Aldus, Netscape, Adobe, WebStar, WordPerfect, FrameMaker, Ragtime, Nisus, PICT.
CONTACT: Larry Hewitt, .

"Publish once" is one of the new buzz words on the Internet -- placing your internal or public documents on the Internet in a usable format, and transferring the reproduction burden and cost to the end user. Distributed documentation -- and it sounds so easy.

But what do you do with those thousands of existing documents in your system? How do you publish legacy archives of data? Do you spend the money to create yet another form of archive and retrieval for the Internet? Worse yet, do you decide to translate all those thousands of documents into HTML, plain text or Adobe Acrobat Portable Data Format? What is the cost of converting and maintaining duplicate sets of documents?

These questions were recently put to the Washington State Department of Information Services Strategic Initiatives Group by the Department of General Administration (GA). Here is the problem in a nutshell:

GA handles hundreds of procurement documents defining approved products for purchase by state agencies, from cameras to toilet paper. GA wanted to place these documents in a searchable format available through a network to facilitate reaching the widest possible audience. The solutions they initially researched would require them to convert existing documents to compatible formats. Ongoing maintenance of two sets of data, the original source documents and the proprietary searchable ones, would involve considerable additional expense.

GA asked Strategic Initiatives to determine if there were any suitable technologies that met the following objectives:

* Documents would be searchable using readily available network tools.

* Documents would be retrievable as text and thereby available for use by customer workstations.

* Documents would be processed by the search engine without translating into additional formats.

It is possible that a variety of original source documents, from multiple kinds of programs and platforms, could be included in the document inventory.

This request for a strategic technology analysis happened to coincide with a seminar put on by Apple Computer on the latest Internet workgroup servers running under the PowerPC platform. At this seminar, the Strategic Initiatives Group learned about AppleSearch, Apple Computer's text retrieval engine and the Web interface which is bundled with the Internet Server package. I secured an Apple 9500 workgroup server on loan from Apple Computer and proceeded to place a test server within the DIS firewall to put AppleSearch to the test.

To bring the Apple PowerPC server up on the Internet was very simple. The software was virtually preconfigured, and with the installation of a Token Ring card and connection to the internal network, the WebStar server software was operational in just a few minutes. By day's end, after reading the AppleSearch documentation and experimenting with the software, I had a working directory with 10 different file types, from three different computer platforms, available for searching and retrieval with the Web interface.

I proceeded to push the software to its limits, and beyond, to determine what limitations were to be found in a real working environment. I threw some very strange files at AppleSearch, including Microsoft PowerPoint, Aldus Persuasion, proprietary help files from internal documentation, and every kind of text-derivative file I could find in the suite of software tools at my disposal.

Since the interface was Web browser-based, there was no difference between using the search engine from a Windows PC, a
Macintosh or a UNIX workstation.

The documents were placed in a directory on the server and indexed. Running the AppleSearch.ACGI (asynchronous common gateway interface) program from a Netscape 1.1 browser, I found that text retrieval was, for the most part, fast and accurate. Retrieved documents were converted into HTML format and displayed on screen by the application. Complex searches could be built using a respectable array of search commands.

The search engine can also be accessed using the AppleSearch Client software, which can run under Windows as well as the MacOS. It is bundled with the server and freely distributable if you own the server software. Client software adds additional functionality, including the ability to save search criteria, called "reporters," for later use; scheduling searches for off-hours; and the ability to add WAIS-type servers to the available searchable resources.

The AppleSearch Client also has the capability of retrieving the original source document, regardless of its format, and saving this to a local hard drive. I was successful in searching an indexed text version of a PageMaker document for analysis, and then was able to retrieve the original document, complete with graphics, which opened as expected with the PageMaker program.

Some improvements could be made to the translators for certain kinds of documents. The AppleSearch.ACGI had difficulty distinguishing between soft and hard returns in some documents, resulting in some improperly wrapped documents when viewed as HTML. The program does have an auto-wrap option which can be applied to these text documents, and formats paragraphs very well. The drawback is the loss of formatting in tables or columns when this feature is applied. However, the document, when saved to a local disk for retrieval by a word processor, did retain the original tabs and hard returns which did not translate well into HTML.

The set of translators I worked with did not include Microsoft Word 6.0, or PageMaker 5.0. This is to be remedied in the near future, according to Apple.

The ability to retrieve the original source document is not yet a part of the Web interface which I tested, but including it should not present a large technical problem.

The initial ACGI we used did not support WAIS servers, which the client software did. However, I downloaded a newer version from Apple's Web site which included the WAIS server capability. With this new ACGI installed, I could define both internal as well as a selection of WAIS servers I chose as document sources. It also had the ability to highlight the matched search words.

The server's ability to index and retrieve PageMaker documents was a major bonus. The AppleSearch translators first read threaded text, keeping the stories intact regardless of the placement of columns, and then jumped back to the first column, compiling text in successive columns regardless of the existence of independent text blocks and headings. In a well-conceived publication, the prospect of returning a sensible text version, in spite of columns and graphics, was very good.

Finally, I discovered that there were many document types not listed in the AppleSearch online manual that also were searchable and which returned a respectable text document.

After throwing the kitchen sink at AppleSearch to determine where it would break, I contacted GA and requested some original source documents. These arrived in Microsoft Word format, and I placed them in the document folder for indexing. The procurement staff at GA was then invited to review how the software functioned.

The procurement documents relied heavily on the use of tables and tabs in the formats, so there were some inconsistencies in auto-wrapped documents. This problem was considered very minor, however, for two reasons: 1) Many of the document requests were for
pieces of information from the original source, not the entire document. A customer can log on the system, search for the appropriate documents, copy and paste the desired information, and quit without having to reproduce large documents. 2) The retrieved text version can be saved to disk and easily reworked by the client. As stated earlier, the original tabs and hard returns are retrieved, and the document can be easily reformatted using templates.

In addition, the original source document can be retrieved using the bundled client software, which can in turn be freely distributed through the host site to most customer workstations operating on the same network.

The array of tools available to search, retrieve and manipulate the documents was compelling enough to impress the procurement staff. GA estimated that the 700 or so documents which it needs to index from this one group alone represents a savings of over 120 hours of labor to convert the initial load. Additional savings in maintaining only a single document would also be anticipated.

I created a Web front-end to highlight the demonstration, provide examples of complex searches, and list supported file types to make a more professional demonstration. But the ACGI, AppleSearch, and WebStar software was stock -- out of the box. An anticipated hardware cost for this system is under $9,000. The same package, running in a less powerful box but using the very same software, can be purchased for under $3,000.

Strategic Initiatives has identified many additional potential uses of this technology. At the recent Information Processing Manager's Association Fall Forum, many state and local government customers had the opportunity to see the demonstration in person, and were impressed by the ease of use and power of this publishing solution.

Since the conclusion of the pilot, I have demonstrated this capability to Tacoma City Light -- a large power utility serving the Pacific Northwest. Tacoma City Light has purchased the MacOS-based server, and is planning to implement AppleSearch for various projects including indexing technical documentation for hardware, software and computer-related systems.

In addition, the Office of the Administrator of the Courts for Washington State is considering implementing this solution for indexing court documents. The Department of Social and Health Services Internet group will also be prototyping a document search application in the coming months using AppleSearch.


The following file types are supported by the AppleSearch software, according to the documentation: Microsoft Word for Windows, versions 1.0 and 2.0; Microsoft Word for Macintosh, versions 3.0, 4.0, 5.0 and 5.1; Microsoft Excel, tab delimited text; WordPerfect for Windows, version 5.1; WordPerfect for Macintosh, versions 1.0, 2.0, and 2.1; FrameMaker MIF, versions 2.0 and 3.0; PageMaker, version 4.0; Ragtime, version 3.1; AppleWorks; WriteNow, versions 2.0 and 3.0; MacWrite, versions 4.5 and 5.0; Nisus, version 3.0; PICT, text extraction only; WordPerfect for Windows, version 5.1; Microsoft Mail, generic text documents, including HTML and many other mail formats. Additional translators for other document formats can be ordered from various vendors.