Finding the Right Fit: How to Scope a Data Project

Before implementing analytics techniques to pull insights from data, governments need to identify those places where data can make the biggest difference.

•

Erica Pincus, Data-Smart City Solutions

This story was originally published by Data-Smart City Solutions. Too often, teams dive into a data project only to realize a few months later that they are either solving the wrong problem, or don't have the data they need to reach an answer. Policymakers, researchers, and the media tend to place emphasis on how to execute data projects in order to produce results, but the prior phase—that of scoping the data project—is equally important for the project's success.

Before implementing analytics techniques to pull insights from data, governments need to identify those places where data can make a difference. At the Civic Analytics Network’s inaugural Summit on Data-Smart Government, Lauren Haynes, former Associate Director of the Center for Data Science and Public Policy (DSaPP) at the University of Chicago, explored the concept of data scoping. In her session, "How to Scope Data Projects," Haynes explored how cities can identify areas ripe for analytics in order to maximize the value of data.

Haynes' team scopes about 60 data projects each year for their Data Science for Social Good (DSSG) Fellows, who partner with nonprofits and government organizations to complete data science projects. In assessing what makes a good DSSG project, Haynes emphasized the need to identify problems that are solvable, challenging, and socially impactful, and that involve both a capable, committed partner and relevant, available data. After accounting for these initial factors, DSSG then works through a project scoping methodology that involves assessing available data via a data maturity framework and surfacing actionable questions through the problem formulation practices described below.

DATA MATURITY FRAMEWORK

Haynes outlined three elements of data projects that project leaders must develop prior to implementation: problem definition, data readiness, and organizational readiness. DSaPP uses a Data Maturity Framework to inform these three areas. In terms of problem definition, Haynes recommends project leaders start by asking about the core business problems to address the question "What is keeping the Mayor up at night?"

Regarding readiness, the Data Maturity Framework offers scorecards, outlined below, that DSSG uses to assess and improve data, tech, and organizational readiness. According to the scorecard, an organization leading in data and tech readiness would store data that is machine readable, in standard open format, available through an API, and collected in real-time without errors or missing information. Regarding organizational readiness, a leading organization would have a culture of data (e.g., leadership would demand data to justify programmatic decisions), and would have policies in place for the use, transfer, and sharing of data internally and externally. To learn more, organizations can download the DSaPP team's framework materials here, and can complete their data maturity framework survey here.

DATA MATURITY FRAMEWORK EXAMPLE: LEAD POISONING PROJECT WITH THE CITY OF CHICAGO

Putting this framework into action, the DSaPP team worked with the City of Chicago to use data to predict homes at risk of lead poisoning. Traditionally, children do not receive blood-lead level tests until they have already been poisoned. Seeking to change this process, the team used Census data and information on blood level tests, home lead inspections, and property assessments to create risk scores to predict whether a household with a child is likely to have a lead paint hazard. Armed with this information, the city can address the risk before serious health problems arise.

In the problem definition phase, the team set out to predict the risk of a child experiencing lead poisoning proactively in order to intervene before poisoning occurs. When the team first started, they set out to predict which homes were at-risk of lead poisoning. But the fellows realized that this was not in fact the relevant question, because many homes may have lead, but little risk of contact or poisoning. Children, however, are more likely to have direct contact with lead by licking windowsills or walls covered with lead paint.

Once the team defined the problem, they used the scorecards to assess the current state of the Chicago Department of Public Health (CDPH)'s data, tech, and organizational readiness to better inform their project scoping and ensure CDPH was sufficiently ready to support the project. For example, the team knew they would need individual-level data tied to specific homes, so it was important they established the readiness and availability of that data upfront.

The team encountered a few specific data challenges along the way. For example, the team planned to use birth certificates from CDPH to identify children, but they soon discovered the city does not receive certificates from the state until two years after birth, at which point at-risk children would already be poisoned. The state was unable to deliver birth certificates faster, so CDPH decided to use records from the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) to identify which households had at-risk children. Children in WIC are oftentimes at higher risk of lead poisoning, so the updated data likely captured many of the highest risk children in the city—sometimes before the child was even born, as some pregnant mothers apply for WIC services.

PROBLEM FORMULATION

Once an organization has defined a problem and established sufficient data and organizational readiness to scope a data project to address it, the organization can dive deeper into problem formulation by identifying its goals and the actions it can take to achieve those goals. Then, as Haynes explained, the organization can break those actions into smaller units consisting of specific questions and sub-actions, and identify relevant data sources it either possesses or needs to gather. Using that information, the organization can identify what analysis needs to be done. This is a more specific scoping process than the upfront problem definition process used to set broader context. While the problem definition phase focuses on the root cause of a problem— what "better" looks like, and what will happen if the organization does not do something differently—the problem formulation phase focuses on what actions the organization must take to actually do something differently.

To illustrate this process, Haynes referred to a politician's goal of attaining a certain number of votes to achieve election. That goal can be broken down into the actions of voter registration, persuasion, and turnout. Within each of those goals, the politician’s team must answer the questions of "who?" and "how?"—for example, who do we need to register to vote, and how can we effectively register those people? Data analysis can help answer those questions, which will ultimately enable the politician to achieve the overarching goal of attaining a certain number of votes needed for election.

After you have established a series of actions, Haynes recommends contemplating "For how many of those actions do you have data?" and "How many of those actions do you own? Over how many of them do you have control?" This will help to narrow the scope of the project. For instance, in the election example, it would be relatively easy to scope a relevant data project as there is a wealth of publicly available data related to voter turnout, voter registration, and best practices in persuasion, and there are direct actions that a candidate could take to impact each of those actions. In other cases, it may be that a team does not have the data or ownership needed to affect all actions required to achieve their overall goal, meaning the team should narrow the scope of the project to address the actions for which they do possess data.

It is important to note that this is not a static analytic process. On the contrary, Haynes explained that scoping is an iterative process, in which problem formulation leads to deeper understanding of the problem via data, which leads to further analysis and model validation, followed by deployment, which can change the initial problem or raise a new one.

PROBLEM FORMULATION EXAMPLE: SANERGY

To concretize the problem formulation process, Haynes referenced Sanergy, a company based in Nairobi, Kenya that installs low-cost, high-quality toilets in impoverished areas and converts waste into organic fertilizer and other useful end products.

One of Sanergy's innovative toilets. Source: Flickr.com/DIVatUSAID

Collecting waste from the toilets involves intense manual labor. The toilets are located far from paved roads and in areas where the streets often change, so Sanergy employees walk from toilet to toilet, collecting and carting solid and liquid waste as they go. This intense labor limits how much waste Sanergy can collect, and thus its ability to bring sanitary waste disposal to as many people as possible. A team of DSSG Fellows worked with Sanergy to help them leverage data to solve their core business challenge: scaling with limited resources.

With the goal of scaling and the action of increased waste collection in mind, the team zeroed in on the question of how often waste needs to be collected from different toilets. First, the team considered potential constraints and levers for change, which led them to the questions of whether they could change the toilet waste collection routes, and whether they could hire more employees to collect the waste. In the end, the problem formulation centered on whether the toilet would fill that day (and hence be unusable). If the answer was no, then Sanergy could spend time emptying other toilets that would fill. If the answer was repeatedly yes, then it might be best to open another toilet. Using that approach, the DSSG team formulated the problem of whether they could predict which toilets were likely to need emptying, and how Sanergy could cluster the toilets to collect the waste as efficiently as possible.

Sanergy had data capturing every toilet they visited each day, and the total pounds of solid and liquid waste collected at each facility each day. DSSG was able to use that data to predict which toilets were likely to be full on a given day. In that way, the DSSG team found that there were more efficient ways the waste could be collected—e.g., churches were less busy on days apart from Sundays and schools were not busy during the summers, so the Sanergy team could commit less resources to collecting waste from churches during the week and schools during the summer.

Using such insights, the DSSG team found more efficient ways to collect the waste, allowing Sanergy to operate more toilets. With the findings from this well-scoped data project, Sanergy could operate as many as 2.5 times more toilets, serving 45,000 additional people with the same resources they had previously been expending.

CONCLUSION

Using a data maturity framework to assess and improve data and organizational readiness is an important first step in approaching data projects. By using such a framework combined with a problem formulation methodology rooted in identifying goals central to key business problems, actions to achieve those goals, related questions, and the data needed to address those questions, a city can scope its data projects more effectively. This will help the city to shift from using data in ways that simply reinforce current actions, toward using data to strengthen and improve upon the city's work to implement its mission. As Haynes said, quoting one of DSaPP's partners, "We are used to using data to justify funding decisions. Now we can use data to improve what we do."