How to Open Data While Protecting Privacy

San Francisco's Open Data Release Toolkit offers detailed guidance on how departments can evaluate whether, and how, sensitive data sets should be made public.

October 05, 2017 •

Blake Valenta, Data-Smart City Solutions

This story was originally published by Data-Smart City Solutions.

The promise of open data is alluring: make civic data widely available and governments and the communities they serve can benefit from transparency, new perspectives and approaches. Cities have published thousands of datasets to make good on this promise. Go to any open data portal, and you will find data on buildings, businesses, and budgets. What you won’t find as much is data about people, and for good reason: real privacy concerns limit its release.

Realizing the full power of open data requires wrestling with privacy issues. While data on buildings, businesses, and budgets is important, knowing how government affects people is equally or more important. Who is getting served or not getting served? What part of the population are they? What are their shared characteristics? Many questions people want answered are personal.

City departments are aware of this demand. They field daily requests for information, often generating time-consuming custom reports that aggregate the underlying data so that private information is obscured. As a result, departments frequently express a desire to make the underlying non-private data generally available to allow anyone to answer their own questions. Departments turn to law and regulation for guidance, but the vast majority of the privacy laws say what to do (de-identify personal data), not how to do it. A recent privacy report notes privacy laws can be "based on outdated PII [personally identifiable information] concepts which may give a false sense of security." This confluence of factors can result in data with privacy implications being siloed in individual departments and excluded from open data programs.

Joy Bonaguro, the Chief Data Officer of San Francisco, knew this was a key barrier to fulfilling the promise of open data in the city. She looked at the resources available, which focused on theory to the exclusion of practicality, and found them poorly suited to departmental needs. The toolkits available pointed out potential areas of focus but offered little guidance on what to do after you 'focused.' To tackle this issue and develop a solution for San Francisco’s needs, Bonaguro turned to Erica Finkle, a program manager working on data policy and strategy in the city’s data office, DataSF.

To start the process of developing a new toolkit, Finkle conducted extensive research and held interviews with key department personnel and privacy experts. She also participated in a forum on open data and privacy hosted by Susan Crawford of Harvard’s Berkman Klein Center for Internet and Society. This step of the research surfaced two key points:

A legal-based privacy policy provides a false sense of security. The current state of academic privacy discussion has moved away from a legal-based privacy framework to a risk mitigation one. The law by nature lags behind technological advancement: following the letter of the law could potentially create privacy risk. Risk mitigation, a more nuanced approach, understands that the law is one of many factors to consider as well as technological and statistical advancements, which demand continual monitoring.
It’s difficult to convene department stakeholders around theory. Department contacts Finkle interviewed expressed skepticism about being able to meaningfully convene the 'right people' in the room if the discussion was purely based on theory. Departments had little appetite for meetings without a business process deliverable; they wanted to know how their process should change in addition to the theory behind why they should change.

This posed a challenge for Finkle. There was a clear need to update the citywide understanding of privacy issues, which would involve theoretical discussions. At the same time, departments wanted deliverables. With this tension in mind, Finkle began drafting the Open Data Release Toolkit.

She designed and structured the toolkit to:

Convey the best practices framework of risk mitigation.
Operationalize this framework via DataSF’s open data publishing process.
Ensure continued engagement with current best practices by including triggers for periodic review.

The result was version 1.0 of theOpen Data Release Toolkit, a document containing a detailed process for how a department can evaluate whether, and how, a sensitive dataset should be released on the city’s open data portal. Despite its name, the toolkit’s process may ultimately indicate that a particular dataset should not be released in any form. In these cases, the dataset might still be made available to researchers, academics, or other departments upon request with a set of privacy and/or security controls in place.

The Open Data Release Toolkit has proved popular with departments. Two real-world use cases speak to its ability to accomplish its goal, showing how departments are using the toolkit to release new data and, along the way, gaining a deeper understanding of a risk-mitigation approach to privacy.

Case Study #1: SF Public Library - Privacy Pros Find Value

If there is one department attuned to the benefits of open data and importance of user privacy, it is the library. As Kate Wingerson, Digital Strategist at the San Francisco Public Library (SFPL), noted, "read our mission statement — the library is about free and equal access." Libraries were the first open data portals; from this historical orientation towards accessibility and openness comes a keen understanding of the importance of privacy. Matthew Davis, Digitization and Collections Manager, noted that the American Library Association's code of ethics and SFPL's own privacy policy put first and foremost the need to balance open access to information with library patrons’ expectations of privacy as to what they access.

One of the primary metrics that often gets requested of the library is usage data — who is using the library and how much — particularly by age categories and neighborhood or supervisor district. If published, interested parties could quickly look at library usage by a particular demographic segment. Wingerson also highlighted even more important benefits than alleviating repetitive reporting work. First, "people use data in ways [the library] cannot anticipate," she explained. There is great potential for new insights when people with different perspectives ask different questions of the data. The second reason, a bit of a passion project for Wingerson and Davis, is that open access to such data encourages the use of library usage data as a social indicator. They want researchers to look at how library usage can speak to the social health of a city or neighborhood.

The toolkit resonated with the library’s privacy orientation. Even before using the toolkit, they de-identified their data by converting birthdates to census-like age ranges and removing addresses and names. However, walking through the toolkit still surfaced new identification issues. For example, releasing both Supervisorial District and neighborhood could create small enough slices of geographies to increase the risk of re-identification. Using the structured process of the toolkit coupled with consultation with Finkle, they opted to only include one geographic boundary. They have since been pleased with the final dataset and resultin g engagement.

Case Study #2: Mayor’s Office of Housing and Community Development - Toolkit as Learning Tool

San Francisco Mayor's Office of Housing and Community Development (MOHCD) is responsible for providing financing for the development, rehabilitation, and purchase of affordable housing as well as partnering with the community to strengthen the social, physical and economic infrastructure of San Francisco's low-income neighborhoods and communities in need. Housing and community development programs and services are among the most important and sensitive issues that face San Francisco. MOHCD receives countless requests for information on the housing projects underway and on the services it funds.

Principally responsible for responding to these requests is Charles MacNulty, Program Development and Data Specialist, who functions as a department analyst, data coordinator, and shaper of internal data standards. His core function is helping to answer internal and external questions about the department's work. MacNulty was one of the early adopters of open data; he saw the value in harnessing the open data platform as a way to streamline standardization and alleviate reporting effort. MOHCD's dataset on its housing projects is among the most popular datasets on the portal. For datasets about building projects, there are little to no privacy concerns.

However, with affordable housing and community development, many people are interested in who is being served by these programs. In San Francisco, these requests brought up a host of privacy concerns and led to an initial default 'do not publish' position for such data.

MacNulty paved the path away from this default position by using the open data release toolkit. MacNulty appreciated that the toolkit provided “a methodology for understanding the role of risk" but "also made a set of recommendations to reduce risk." This made the toolkit actionable and goal-driven with a final deliverable of a newly-published dataset.

This focus on deliverables helped MacNulty to convene a meeting of all the major project areas of MOHCD: program heads, homeownership and below market rate program staff, multifamily housing development and asset management staff, and key members from the community development team. For the meeting, MacNulty had all the representatives work through the toolkit to assess the risk in publishing MOHCD's Community Development datasets, which contain information on individual community development projects as well as the clients served by them. The conversation was challenging but productive. MacNulty felt the toolkit did a great service of explaining through practice why a department cannot simply have a single policy for all datasets. As they walked through the toolkit, the team realized that risks associated with the Community Development dataset must be assessed for each additional departmental dataset.

While MacNulty feared they might find the process burdensome “it actually facilitated deep, meaningful discussion for each item and enriched the department's understanding of the privacy issue," he said. The ordered nature of the toolkit helped keep the conversation from meandering. By the end of the meeting, they had agreed to publish the Community Development datasets using de-identification. MacNulty felt the participants left with a better understanding of "what it means to publish data publicly and why we publish data."

Just the Beginning

The Open Data Release Toolkit is now up to version 1.2 and will continue to be updated as the privacy landscape evolves. It’s also been expanded to include a Security Edition. Finkle’s efforts with the Open Data Release Toolkit have been noted in the wider privacy world, and other cities including Seattle and Durham have adopted the toolkit for their own work. Many of the challenges present in sharing data with the public via open data also occur in interdepartmental data sharing. Finkle translated numerous lessons from researching the Open Data Release Toolkit to the ShareSF program, an initiative to facilitate inter-departmental data sharing. To date, Finkle is pleased with the citywide response to the toolkit and its ability to foster conversations that are not only important for the publishing of open data, but for the future of efficient government.