An explanation of open data, why it's important and how you can do it yourself.
Below is a selection of file formats taken from the California Open Data Handbook. Each has a brief description for a quick reference.
JSON is a simple file format that is very easy for any programming language to read. Its simplicity means that it is generally easier for computers to process than other formats.
XML is a widely used format for data exchange because it lets users maintain data structure, and allows developers to include documentation without interfering with reading of the data.
RDF makes it possible to represent data in a form that makes it easier to combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations. RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing open data initiatives on the Web. Use of RDF is not widespread, but it has been a trend among open government initiatives.
Many agencies have information in spreadsheet formats like Microsoft Excel. This data often can be used immediately with the correct descriptions of what the different columns mean. But spreadsheets may also contain macros and formulas, which can be more cumbersome to handle.
Comma Separated Files
CSV files are compact and thus suitable to transfer large sets of data with the same structure. However, the format is so spartan that data are often useless without documentation since it can be almost impossible to guess the significance of the different columns. Furthermore it is essential that the structure of the file is respected.
Classic documents in formats like Word, ODF, OOXML, or PDF may be sufficient to show certain kinds of data -- for example, relatively stable mailing lists or equivalent. This approach may be inexpensive because most data are born in this format. But the format gives no support to keep the structure consistent, which often makes it difficult to enter data by automated means. Generally it is recommended not to exhibit in word processing format, if data exists in a different format.
Plain text documents (.txt) are easy for computers to read. They generally exclude structural metadata from inside the document however, meaning that developers will need to create a parser that can interpret each document as it appears. Also some problems can be caused by switching plain text files between operating systems.
This is probably the least suitable form for most data, but both TIFF and JPEG-2000 formats can at least be marked with documentation of what is in the picture. Scanned image formats may be relevant for displaying information that wasn't born electronically -- for example, old church records and other archival material where a picture is better than nothing.
Today, much data is available in HTML format on various sites. This may well be sufficient if the data is very stable and limited in scope. In some cases, it could be preferable to have data in a form that's easier to download and manipulate. But it's cheap and easy to refer to a page on a website, so HTML can be a good starting point in the display of data.
Never miss a story with the daily Govtech Today Newsletter.