An explanation of open data, why it's important and how you can do it yourself.
Though the debate about open data in government is an evolving one, it is indisputably here to stay -- it can be heard in both houses of Congress, in state legislatures, and in city halls around the nation.
Already, 39 states and 46 localities provide data sets to data.gov, the federal government's online open data repository. And 30 jurisdictions, including the federal government, have taken the additional step of institutionalizing their practices in formal open data policies.
Though the term "open data" is spoken of frequently — and has been since President Obama took office in 2009 — what it is and why it's important isn't always clear. That's understandable, perhaps, given that open data lacks a unified definition.
“People tend to conflate it with big data," said Emily Shaw, the national policy manager at the Sunlight Foundation, "and I think it’s useful to think about how it’s different from big data in the sense that open data is the idea that public information should be accessible to the public online."
Shaw said the foundation, a Washington, D.C., non-profit advocacy group promoting open and transparent government, believes the term open data can be applied to a variety of information created or collected by public entities. Among the benefits of open data are improved measurement of policies, better government efficiency, deeper analytical insights, greater citizen participation, and a boost to local companies by way of products and services that use government data (think civic apps and software programs).
“The way I personally think of open data," Shaw said, "is that it is a manifestation of the idea of open government."
For governments hoping to adopt open data in policy and in practice, simply making data available to the public isn’t enough to make that data useful. Open data, though straightforward in principle, requires a specific approach based on the agency or organization releasing it, the kind of data being released and, perhaps most importantly, its targeted audience.
According to the foundation’s California Open Data Handbook, published in collaboration with Stewards of Change Institute, a national group supporting innovation in human services, data must first be both “technically open” and “legally open.” The guide defines the terms in this way:
Technically open: [data] available in a machine-readable standard format, which means it can be retrieved and meaningfully processed by a computer application.
Legally open: [data] explicitly licensed in a way that permits commercial and non-commercial use and re-use without restrictions.
Technically open means that data is easily accessible to its intended audience. If the intended users are developers and programmers, Shaw said, the data should be presented within an application programming interface (API); if it’s intended for researchers in academia, data might be structured in a bulk download; and if it’s aimed at the average citizen, data should be available without requiring software purchases.
“Owning Microsoft Office shouldn’t be a requirement for accessing data,” Shaw said, referring to Microsoft Excel, a common file format for data. When possible, open data should come packaged in a variety of file formats that cover as many potential users as possible.
Legally open means open data must be free for all users, or as the handbook puts it, should allow for “universal participation.” It can’t be isolated only to educational use, for example, or bar companies from putting it in products or be under a license that prevents one person from sharing it with another.
For those unnerved about the unrestricted use, Shaw advised to remain calm. The common liability fears are almost always unwarranted and hardly ever realized. “We don’t see it as a huge problem," she said. "I think it’s mostly about a fear of the ‘new.’ Governments are sometimes very risk averse.”
The ultimate advantage of unrestricted use? Interoperability, according to the foundation. “Interoperability denotes the ability of diverse systems and organizations to work together (inter-operate). In this case, it is the ability to interoperate — or intermix — different data sets,” the handbook says.
Though the term open data has been around since at least 2009, the concept is still new. The rules are moving, firming up, gestating. But its youth shouldn’t translate to mean temporary -- the endorsement for open data is there.
In Washington, D.C., the call for open data is taking center stage in a bill that’s passed the U.S. House of Representatives and is on track for a Senate decision. The Digital Accountability and Transparency Act, or DATA Act, if approved, would publish all federal agency expenditures and would require that data be standardized and reviewed to prevent abuse. The House approved it 388 to 1, with 41 members not voting.
Open Datasets in States and Localities
The Sunlight Foundation has listed 30 states and localities with their own open data policies (with numerous others pending) -- see our interactive open data map for more details.
A key lobbying tactic for DTC is to sell the value of open data and downplay the terminology. Hollister says its usually easier to build support for open data among policy-makers and average citizens by telling them what it does. Jobs, transparency, open government, citizen engagement, data-driven decisions, an informed public — these terms are other ways to express open data without saying it directly.
“We’re trying to persuade policy makers to replace disconnected documents with open data and for that purpose we just over-simplify it. We over-simplify it in two steps: number one, standardize [data]; and number two, publish it,” Hollister said. “Even small changes are going to make a big difference.”
The Sunlight Foundation's Shaw echoed Hollister’s stance.
“I’m not sure it’s necessary that the term itself becomes a huge rallying point," she said, "but I think what it enables does have broad public resonance."
Creating open data isn’t without its complexities. There are many tasks that need to happen before an open data project ever begins. A full endorsement from leadership is paramount. Adding the project into the work flow is another. And allaying fears and misunderstandings is expected with any government project.
Need Some Open Data Guidance?
Visit our list of open data resources to determine how you can open up your data.
Not sure which format is best for the data you want to make public and available? Check out some of the common file formats used to share data.
After the basic table stakes are placed, the handbook prescribes four steps: choosing a set of data, attaching an open license, making it available through a proper format and ensuring the data is discoverable.
1. Choose a Data Set
Choosing a data set can appear daunting, but it doesn’t have to be. Shaw said ample resources are available from the foundation and others on how to get started with this — see our list of open data resources for more information. In the case of selecting a data set, or sets, she referred to the foundation’s recently updated guidelines that urge identifying data sets based on goals and the demand from citizen feedback.
2. Attach an Open License
Open licenses dispel ambiguity and encourage use. However, they need to be proactive, and this means users should not be forced to request the information in order to use it — a common symptom of data accessed through the Freedom of Information Act. Tips for reference can be found at Opendefinition.org, a site that has a list of examples and links to open licenses that meet the definition of open use.
3. Format the Data to Your Audience
As previously stated, Shaw recommends tailoring the format of data to the audience, with the ideal being that data is packaged in formats that can be digested by all users: developers, civic hackers, department staff, researchers and citizens. This could mean it's put into APIs, spreadsheet docs, text and zip files, FTP servers and torrent networking systems (a way to download files from different sources). The file type and the system for download all depends on the audience.
“Part of learning about what formats government should offer data in is to engage with the prospective users," Shaw said.
4. Make it Discoverable
If open data is strewn across multiple download links and wedged into various nooks and crannies of a website, it probably won't be found. Shaw recommends a centralized hub that acts as a one-stop shop for all open data downloads. In many jurisdictions, these Web pages and websites have been called “portals;” they are the online repositories for a jurisdiction’s open data publishing.
“It is important for thinking about how people can become aware of what their governments hold. If the government doesn’t make it easy for people to know what kinds of data is publicly available on the website, it doesn’t matter what format it’s in,” Shaw said. She pointed to public participation — a recurring theme in open data development — to incorporate into the process to improve accessibility.
Examples of portals, can be found in numerous cities across the U.S., such as San Francisco, New York, Los Angeles, Chicago and Sacramento, Calif.
California Open Data Handbook: This is a guide published by the Stewards of Change Institute that explains what open data is, why it's important and the technical nuances behind opening it up.
Sunlight Foundation, Open Data Guidelines: The Sunlight Foundation is a well-known open data advocate. These guidelines offer advice and best practices for governments that want to start an open data project.
Open Data Institute: In Europe and across the globe the ODI is making waves by linking open data with businesses and organizations. The organization offers tools, tips and classes on open data use in addition to certification of open data types.
The Data Transparency Coalition: A transparency lobbying group that has been working with legislators in Washington D.C. and has a website explaining and monitoring the issues around open data.
Open Data Definition: Want examples of "open" licenses that can be added to your data? This site has a collection of licenses for reference and use.
Below is a selection of file formats taken from the California Open Data Handbook. Each has a brief description for a quick reference.
JSON is a simple file format that is very easy for any programming language to read. Its simplicity means that it is generally easier for computers to process than other formats.
XML is a widely used format for data exchange because it lets users maintain data structure, and allows developers to include documentation without interfering with reading of the data.
RDF makes it possible to represent data in a form that makes it easier to combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations. RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing open data initiatives on the Web. Use of RDF is not widespread, but it has been a trend among open government initiatives.
Many agencies have information in spreadsheet formats like Microsoft Excel. This data often can be used immediately with the correct descriptions of what the different columns mean. But spreadsheets may also contain macros and formulas, which can be more cumbersome to handle.
Comma Separated Files
CSV files are compact and thus suitable to transfer large sets of data with the same structure. However, the format is so spartan that data are often useless without documentation since it can be almost impossible to guess the significance of the different columns. Furthermore it is essential that the structure of the file is respected.
Classic documents in formats like Word, ODF, OOXML, or PDF may be sufficient to show certain kinds of data -- for example, relatively stable mailing lists or equivalent. This approach may be inexpensive because most data are born in this format. But the format gives no support to keep the structure consistent, which often makes it difficult to enter data by automated means. Generally it is recommended not to exhibit in word processing format, if data exists in a different format.
Plain text documents (.txt) are easy for computers to read. They generally exclude structural metadata from inside the document however, meaning that developers will need to create a parser that can interpret each document as it appears. Also some problems can be caused by switching plain text files between operating systems.
This is probably the least suitable form for most data, but both TIFF and JPEG-2000 formats can at least be marked with documentation of what is in the picture. Scanned image formats may be relevant for displaying information that wasn't born electronically -- for example, old church records and other archival material where a picture is better than nothing.
Today, much data is available in HTML format on various sites. This may well be sufficient if the data is very stable and limited in scope. In some cases, it could be preferable to have data in a form that's easier to download and manipulate. But it's cheap and easy to refer to a page on a website, so HTML can be a good starting point in the display of data.