In this Q&A, Chicago Chief Data Officer Brett Goldstein and his staff go in-depth about intercity collaboration, advanced data mining, standardized definitions and data cataloging.
Chicago’s election of Mayor Rahm Emanuel resulted in an addition of new city staff dedicated to making the city a more open and transparent government. Brett Goldstein was brought on 10 months ago as the city’s chief data officer, and as part of Emanuel’s platform, is responsible for overseeing the city’s open data initiatives.
Chicago considers itself to be leading in open data projects. Since Emanuel took office, the city has implemented projects to make Chicago data more accessible to the public such as with websites like Wasmycartowed.com and ChicagoBudget.org. More recently, the city, in combination with Cook County and Illinois, developed a “convergence cloud” so public data can be accessible across the three levels of government.
Goldstein and Danielle DuMerer, a project manager for the Chicago Department of Innovation and Technology, discuss Chicago’s open data and what it takes to federate data across multiple jurisdictions.
Brett Goldstein: It falls into three worlds: One is the open data piece. Open data — transparency — that’s part of my portfolio and that’s what Danielle and I collaborate on quite a bit. Mayor [Rahm] Emanuel, in his campaign, spoke quite a bit about transparency coming into the new administration. He was very clear we were going to deliver on that. And during the first 10 months, that’s been a focus.
And we have CityofChicago.org along with the federated MetroChicago site, where we are piping that data. It’s telling everyone out there — from the media, to residents, researchers and engineers — these [data sets] you, historically, might have “FOIA-ed” [in an open records request under the Freedom of Information Act] and now they’re there. They’re automatically uploaded and they’re accessible. So there’s the whole open data piece.
The other two areas are a little more subtle. The second piece is bringing data in a quantitative and empirical approach to government. And my position sits in the policy team in the Mayor’s Office. As we look at various policy issues, my job is to make sure we’re incorporating an empirical methodology as to how we’re looking at that issue. What is the quantitative rationale? The mayor is very data driven, and he wants to have the supporting evidence. And that also propagates through the departments and through the different agencies. How do we make sure we’re raising the bar on how we’re using data to make the right decisions? And that’s within a department, but it’s also interdisciplinary.
I think we all know that many of the issues facing government are cross-departmental, and historically, we might not have been very good at doing that. Things would have stayed within siloes rather than a collaboration — that’s one of the pieces we’re working through.
The third area, and this is where I have quite a bit of passion, is in the world of data mining, data modeling and predictive analytics. It’s having governments think about, “How do we prevent rather than react?” In the private sector, we’ve seen people use predictive analytics every day. I was with Open Table for some time. (Editor’s Note: Goldstein formerly was the IT director for the company, whose popular online platform takes restaurant reservations.) And how do we in the Web sector use data to make better decisions? Well, we should be doing that in government. And part of my role now is to say, “How can we use those techniques to do government better and do it smarter?”
BG: We have more data than I thought we did. I was with the police department before I came over to the mayor’s office. So I knew we had quite a bit of data, but I didn’t really get at the depth of it. As described before, it’s quite siloed. So there’s really an opportunity to get at the mayor’s vision of an interdisciplinary, interdepartmental way of doing business, but you need to deal with the siloes and find a way to make a holistic data platform.
BG: I think integration is important — creating a holistic platform with government is also important. But to make things like this sustainable, you need to ensure you’re not a one-hit wonder. And something I’m really happy about: a couple weeks ago, Chicago City Council passed an ordinance accepting a grant from the MacArthur Foundation to work with Chapin Hall [a research and policy center at the University of Chicago] and document all of our data. That sounds remarkably mundane to 99.99 percent of the population. However, documentation of data throughout the enterprise: where it is, what it does, what is its meaning – the comprehensive metadata, and then applying governance to it is what makes it sustainable. That’s a heavy lift. Consider the amount of data you have through municipal government. Getting at it and doing it right through that sort of sustainable technique is where we can really add a lot of value here.
BG: There are quite a few. The immediate one is preventing duplication of effort. If you have two departments doing almost the same thing, there’s an opportunity to reduce that workflow — fewer people doing it, saving money, saving people’s time. Two, sometimes finding the data is like [searching for] a needle in a haystack. Schemas can be cryptic. If I’m looking for a certain piece of information and the field name is absolutely ridiculous, I am never going to find it.
Let’s say I’m a researcher and I’m working on a public safety problem. Public safety is not just about the police. It’s about education, it’s about economics, it’s about the community. It’s about public health. Say I need a piece of public health data — what is the governance on the data? And does it even exist? A fast, searchable, documented way of doing that gets at it. But how does it improve the public? And this ties to the data portal. As we go through this data and document it, we constantly ask ourselves a question: “Is this something that should be public?” If it is, then Danielle and I have the opportunity to send it directly to the data portal.
And you get the sustainability piece. We use ETLs [extract, transform, load] on our data so that it fires out on its own. It’s not about her, it’s not about me, it’s not about anyone else on the team. It’s just self-populating. When you have a front-facing document that talks about what is the metadata, it allows the public to search and find what they want. Or the media can search and find what they want, absent of FOIA.
BG: Well, certainly projects cost money. We’re using a turnkey solution in this case — Socrata. There is a cost to that. Danielle’s time has a soft cost, I have a soft cost. But as in any given system, you always make priorities. And those priorities in some cases are dynamic. For Mayor Emanuel, transparency’s a top priority. This is something that we started working on right at the beginning of the new administration. The mayor is clear on the priorities, so our job is to execute that.
Danielle DuMerer: It was a very positive, collaborative effort with the state and Cook County. One of the things we had to do to bring together the data was that each of our sites has different categories that the data is associated with. So we worked together with Socrata to develop categories that made sense across all of the agencies, and then we mapped our catalogs to those categories. And it actually was a much easier process than you might have expected. But that was really the most challenging piece. It was a relatively straightforward task to bring it together.
BG: Any sort of data federation can be built on whatever. This was easier to get out the door because everyone’s on the same platform. But in government right now, we should be requiring APIs [application programming interfaces]. And they should be compliant APIs. If you were using one platform and I’m using Socrata, we could build something on top of that if the APIs are available. It’s really a question of spending the time and effort to bring it together.
This specific case [in Chicago and Illinois] was easier because everyone was on the same platform. I’m much more worried — as Danielle was talking about — about categories for data or the schema naming convention. I think that’s a challenge as we look to broader, federated data initiatives. You have different states using different vernacular. If you look at public safety data, an “assault” in one place is a “battery” in another. So there’s certainly a challenge as to how do we create this common dictionary, which probably needs to be dynamic.
BG: I think it’s a reasonable goal. I think it’s exceptionally hard, and it’s not necessarily a technical challenge if you build it intelligently from the bottom-up. If you have sustainable documentation and ETLs that are well thought out, then agencies should start to collaborate and then you will again come to this idea of the data dictionary. And that’s where the heavy lifting will occur. If you have that multi-agency effort, and a commitment to having that language and understanding the mappings and the conversions, it’s doable. But it is a heavy lift.
BG: One of the things I found fascinating since coming into the administration is the amount of collaboration between different cities. I had always heard, “Oh yeah, people don’t get along with the different cities. Everyone is doing their own thing.” I talk to New York City probably every other day at this point. Last week, I was on the phone with the state of Maryland — Bryan Sivak is over there. (Editor’s Note: Sivak is Maryland’s chief innovation officer.) There’s a group of us now that are talking all the time about how we do this. I think there are some deep, technological issues that need to be thought out. How do you maintain a dynamic mapping system like that? How do you have governance? But these are all doable problems, and if the cities and states start collaborating, it’s something we can make happen. But it is a bunch of work.
BG: We have millions and millions of rows of data out there. Starting by saying we’re going to map everything to everyone is too ambitious. I would suggest a straw case — picking a couple of topical areas. It’s kind of like in MetroChicagoData, where we looked at health-care facilities. Pick a couple of areas, have a bunch of the big cities get together. And then do those as proof-of-concept exercises — more of an agile approach to it. We would go through it, learn the hard lessons over a few months, and then try to get an extensible process out of that.
DD: Wasmycartowed.com was developed by a local software developer, Scott Robin. He leveraged a couple of transactional data sets in our portal: towed and relocated vehicles, which we are running regular ETLs on every 15 minutes and updating that data in the portal. So that allows Wasmycartowed to keep the data fresh. We take that approach as much as possible to automate the extract, transform and load (ETL) process to the portal, which means that developers have access to fresh data daily.
BG: In Chicago, it’s really a two-way discussion. We put out data that we think is useful or we found in our FOIA logs that’s commonly asked for. But at the same time, people approach us. They’ll hit me on Twitter or I’ll talk at a meetup or they’ll email Danielle and say, “We would love to do X.” And then that becomes a project for us. And it’s like all the other things we want to do it sustainably. We hook into transactional systems, we update automatically and then they’re able to access it via the API and have a useful city service. Like Sweeparound.us [http://sweeparound.us/] — you go there, you put in your address and the night before the sweet sweeper comes, it sends you a text message and an email. That is very useful for me. I’ve been ticket-free since that website was put up.
BG: We have a lot more data we want to get out the door. Some of the things you’ll see in the near term include a lot more 311 data. And you’re going to see it on a couple of fronts. First, in the data portal, it will be accessible via API [application programming interface]. Two, we have Code for America fellows here, and they’re working on our Open 311 effort. So those are some of the near-term, big data things about the community that are coming out.
Down the line, though, we’re talking about advanced analytics. It’s fine to release data and have it available, but how you use it — to make us do our business better, perform better for our residents, perform it at less cost — is critical. And that goes back to the third tier of what I do, which is the predictive analytics and the data mining. I’m tasked with saying, “How can we deliver things in a smarter, better way?” If we’re able to find certain patterns — take 311, for example. Potentially there are leading indicators for certain activities. For example, there’s one part of the city where, when the alley lights go out then the garbage cans disappear. And it just happens in this one area. If you look at spatial, temporal data, you find that if you can take an early intervention in that location, you can prevent a service problem. We can apply algorithms to that data, and we’re able to get to the model where we’re not dealing with things in batch, but dealing with things based on spatial and focused attributes. So that’s where the mayor wants us to go. It all comes down to smarter, better, across the board.