Open Data File Formats

Below is a selection of file formats taken from the California Open Data Handbook. Each has a brief description for a quick reference.

JSON

JSON is a simple file format that is very easy for any programming language to read. Its simplicity means that it is generally easier for computers to process than other formats.

XML

XML is a widely used format for data exchange because it lets users maintain data structure, and allows developers to include documentation without interfering with reading of the data.

RDF

RDF makes it possible to represent data in a form that makes it easier to combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations. RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing open data initiatives on the Web. Use of RDF is not widespread, but it has been a trend among open government initiatives.

Spreadsheets

Many agencies have information in spreadsheet formats like Microsoft Excel. This data often can be used immediately with the correct descriptions of what the different columns mean. But spreadsheets may also contain macros and formulas, which can be more cumbersome to handle.

Comma Separated Files

CSV files are compact and thus suitable to transfer large sets of data with the same structure. However, the format is so spartan that data are often useless without documentation since it can be almost impossible to guess the significance of the different columns. Furthermore it is essential that the structure of the file is respected.

Text Document

Classic documents in formats like Word, ODF, OOXML, or PDF may be sufficient to show certain kinds of data -- for example, relatively stable mailing lists or equivalent. This approach may be inexpensive because most data are born in this format. But the format gives no support to keep the structure consistent, which often makes it difficult to enter data by automated means. Generally it is recommended not to exhibit in word processing format, if data exists in a different format.

Plain Text

Plain text documents (.txt) are easy for computers to read. They generally exclude structural metadata from inside the document however, meaning that developers will need to create a parser that can interpret each document as it appears. Also some problems can be caused by switching plain text files between operating systems.

Scanned image

This is probably the least suitable form for most data, but both TIFF and JPEG-2000 formats can at least be marked with documentation of what is in the picture. Scanned image formats may be relevant for displaying information that wasn't born electronically -- for example, old church records and other archival material where a picture is better than nothing.

HTML

Today, much data is available in HTML format on various sites. This may well be sufficient if the data is very stable and limited in scope. In some cases, it could be preferable to have data in a form that's easier to download and manipulate. But it's cheap and easy to refer to a page on a website, so HTML can be a good starting point in the display of data.

Government Technology Staff Writer Jason Shueh Jason Shueh  |  Staff Writer

Jason Shueh is a staff writer for Government Technology magazine. His articles and writing have covered numerous subjects, from minute happenings to massive trends. A San Francisco Bay Area native, Shueh grew up in the east bay and Napa Valley, where his family is based. His writing has been published previously in the Tahoe Daily Tribune, Amazon Publishing, Bike Magazine, Diablo Magazine, The Sierra Sun, Nevada Appeal, The Union and the North Lake Tahoe Bonanza.