Finding the Rat and the Other Challenges of Predictive Modeling

Case studies from Washington D.C., Boston, New York City, and San Jose highlight the potential advancements and potential pitfalls of using predictive modeling to improve city services, and offer a few common considerations.

•

Jess Weaver, Data-Smart City Solutions

This story was originally published by Data-Smart City Solutions.

Ever heard of a “rat ride-along?” Neither had Peter Casey, Senior Data Scientist in The Lab @ DC and member of the Office of the Chief Technology Officer. But carpooling with D.C.’s rodent control team was all in a day’s work for a data scientist tasked with building and testing the technology to predict which of D.C.’s neighborhoods would be most likely to experience rodent outbreaks, a major public health concern.

In fact, “ride-alongs” are common practice among the technology teams that build predictive models that anticipate – and optimize response to – urban concerns, be it restaurant health code violations, fires, substandard buildings, or housing discrimination. Case studies from Washington D.C., Boston, New York City, and San Jose highlight the potential advancements and potential pitfalls of using predictive modeling to improve city services, and offer a few common considerations.

A PROBLEM IS OFTEN BIGGER (OR AT LEAST DIFFERENT) THAN REPORTED.

D.C., the country’s so-called “rat capital,” saw a 133 percent increase in requests for rodent abatement over the last few years, according to Casey. Unclear, however, was whether the drastic increase resulted from a genuine explosion in the city’s rodent population or merely greater use of D.C.’s 311 reporting system.

The city could have simply targeted rodent control’s response to neighborhoods with the highest concentration of reports. However, areas with high reporting levels tend to be characteristic of “squeaky wheels” (informed and concerned citizens more likely to report via 311), so Casey and his team at The Lab @ DC wanted to look more holistically at the city’s rodent problem.

In addition to the 311 data, the city considered a myriad of factors in developing its predictive model, including population density, zoning, building age, business licenses (in particular food vendors), and the presence of impervious spaces, which inhibit rats’ ability to burrow. Analysts initially added construction sites to the model, but through field testing (such as Casey’s “rat ride-along”), the team was able to eliminate the ultimately unproven assumption that construction exacerbates the city’s rodent issue. By analyzing the issue beyond the limited scope of citizen reports, the city was able to deliver services in the areas truly in most need of abatement services.

For both D.C. and Boston, a diversity of data sources, validated through regular testing, contributed to the strength of their respective predictive models. Like the rodent abatement initiative in D.C., Boston’s efforts to predict the restaurants most likely to be in violation of health code considered 311 reports as a critical source of data, but the team also took a more comprehensive approach to analyzing the problem. Analyzing Yelp reviews, as well as established indicators such as the last date of inspection and the history (and severity) of past infractions, ensured that Boston’s model considered both citizen-driven response and best practices in data-driven decisionmaking.

IF IT’S NOT PART OF THE WORKFLOW, IT’S NOT WORKING.

Innovation teams can build the most sophisticated analytics programs, noted Michelle Tat, Chief Data Scientist for the City of Boston, but if they’re not seamlessly integrated into an agency’s workflow, they’re essentially superfluous. With a clear understanding that the predictive modeling around restaurant inspection initiative was initially informative rather than directive, Tat and her team created a weekly automated spreadsheet with a list of restaurants prioritized for high likelihood of violation. Her team in fact wasn’t sure that inspectors were using the model until it broke, and they were inundated with a flurry of requests about why the system hadn’t updated.

For Craig Campbell, Special Advisor to the Mayor’s Office of Data Analytics (MODA) in New York City, capacity building isn’t just about workflow integration within a single department. MODA also invests considerable resources in making its code available to all agencies through an open source library in an effort to bolster transparency and to replicate potential solutions, activating adaptation from fire prevention projects to protecting tenants rights.

According to Tat, Campbell, and Lauren Haynes, former Associate Director of the Center for Data Science and Public Policy at University of Chicago, sometimes the greatest value an innovation team can offer is simply gathering the appropriate data sources and saving departments time. 311 data, as Tat admonished, isn’t one dataset in the city of Boston: it’s 1,100, and curating the suitable sets is a tremendous value-add for time and resource-strapped service delivery departments. In San Jose, a major hurdle in developing a valuable building code violation model for multi-story units, noted Haynes, was simply curating the sources to generate a list of eligible buildings, which required liaising with multiple departments to secure the necessary data. Once in place, however, relevant departments can access data as needed, streamlining the processes of collection, analysis, and implementation.

CONSIDER COLLECTIVE PRIORITIES OVER INDIVIDUAL BAD ACTORS.

Data without context is dangerous, especially if the stakes are high. In the case of predicting building code infractions in San Jose for instance, the likelihood of a site having a code infraction did not necessarily correlate with the severity of the infraction itself. A staircase without handrails, for instance, might have been uncovered before a broken smoke alarm; however, one clearly deserves priority attention.

As Haynes and her team were developing the predictive model, therefore, it was essential to use both in-person site visits and inspection notes to be able to ensure that the program learned how to distinguish the severity of violations, enabling inspectors to respond to the most egregious first. Without adding that critical nuance to the model, it would have failed to correctly prioritize violations and inadequately optimize responses from inspectors as a result.

In New York City, a housing analytics program raised different questions around prioritization. MODA worked with the Commission on Human Rights (CHR) to predict the landlords most likely to discriminate against holders of subsidized housing vouchers. Analysts cross-referenced areas where landlords were most likely to engage in these discriminatory practices – often characterized by low crime rates, high performing schools, and suspiciously low utilization of affordable housing vouchers in spite of available housing stock – with landlords who owned a significant amount of property. In terms of optimizing response for maximal impact, Campbell explained, rather than ticking through a list of bad actors, the city targeted the largest landlords, creating a chilling effect for other landlords across the city. Gathering the data was only half of the challenge in optimization, in other words. The other half was aligning responses to the worst cases to a communications strategy that would catalyze that chilling effect.

Data-driven decisionmaking can uniquely optimize governments’ responses to complex public challenges – whether in public health, safety, or civil rights. Insights are only truly actionable, however, when the data is gleaned from the right variety of sources and aligned with an effective strategy for execution. And, as data innovation leaders from across the country agree, the efficacy of the models behind data-driven decision-making must be regularly tested and iterated. So urban technologists, better strap in…you may be headed for a rat ridealong.