Editor’s Note: Steve DuScheid is marketing director of Maponics, a developer of polygonal map data, such as neighborhood boundaries, ZIP codes and school attendance zones.
Every year, federal and state government agencies collect, analyze and publish an enormous amount of data — directly and through grants to universities and foundations. Researchers and policymakers often segment this data by geographic area to compare regions, analyze trends and draw conclusions. One challenge to effectively grouping data by geography is finding the right level of granularity suited to answering particular questions. Too often, researchers simply use what’s readily available or must be satisfied with the level of geography inherent in the processes or organizations used to collect it.
Some common geographic entities used to segment and analyze data include: county, ZIP code and U.S. Census Bureau geography (i.e., block groups).
While there are real benefits to using these defined areas — including wide availability, broad geographic coverage, and the ability to link and compare multiple data sets — none of them truly reflect social and cultural boundaries at the local level. Therefore, they may not answer fundamental research questions or address key factors for policy decisions. ZIP codes and similar entities were defined to facilitate and administer government operations and services — and while some may take into account population characteristics — their borders aren’t meaningful to local citizens.
Standard geographic entities will always be important in how researchers analyze data and how policymakers draw conclusions. But with the availability of new geographic data sets and the growing volume of geotagged data, it’s now possible for researchers to consider questions in new ways that align data to the geographic areas most relevant to answering them.
Below are some of pros and cons of using the standard geographic entities in research and some alternatives that offer new ways to look at data.
County. There are many data sets collected and managed at the county level and made available to federal, state and local government agencies. There are many reasons for this — not least of which is the established infrastructure in place within county governments. Also, data at the county level is manageable to work with because there are only about 3,100 counties in the U.S. But counties are far too large (averaging more than 3,000 square miles) and too varied in population (from as few as 45 to as many as 9 million people) to get at many local socio-economic questions. Population groups within counties are often too diverse for researchers to characterize behaviors or outcomes.
ZIP code. Zone Improvement Plan codes were created by the U.S. Post Office Department in 1963 to improve mail delivery service. ZIP Codes are defined and made up of carrier routes, also designed to optimize mail delivery. Researchers are drawn to ZIP Codes for obvious reasons—they are essentially ubiquitous in databases and they can be easily linked to households and related demographics.
Because ZIP codes were so prevalent for data collection and aggregation, beginning with the 2000 Census, the U.S. Census Bureau compiled and released a new set of geographic areas called ZIP Code Tabulation Areas (ZCTAs) intended to align census-tabulated data to ZIP code areas.
While ZIP codes and ZCTAs are generally easy to use, the geographic areas that they represent are a function of process — not people. Other than knowing a ZIP code to address a piece of mail, people don’t use or relate to them and certainly don’t place any cultural significance on their boundaries.
Census. The primary way researchers organize and analyze data is by the geographic entities defined by the Census. This is because when it comes to demographics, almost all data — whether published directly by the Census or by private companies — originates from the core decennial Census dataset. In terms of small-area analysis, the following Census geographic entities are often used (along with the number of each entity as of the 2010 Census): blocks (11.1 million), block groups (220,000) and census tracts (65,000).
Census geography was developed primarily to facilitate, execute and tabulate the decennial census. As a result, it not only covers the entire U.S. and its territories but also is organized into a clean hierarchy, with larger areas (e.g., counties) composed of a set of smaller areas (census tracts). The Census boundaries also largely obey administrative entities, ensuring, for instance, that block groups don’t cross county lines. And while Census entities are designed to be relatively homogeneous with respect to their population characteristics, they are still derived through an administrative process and are not determined organically by the people who live in them. As a result, analysis performed strictly by these geographic units is limited in terms of how well it represents populations segmented according to locally defined boundaries.
When examining cultural and social trends at the local level, neighborhoods are typically the geographic areas that best reflect how local residents think about the places where they live, work and play. People don’t think about the area around them in terms of ZIP Codes or census tracts — in fact, very few people have any idea where these begin and end in the area immediately surrounding their homes and communities. But people can almost certainly identify and describe their neighborhood as well as the surrounding ones. This is, of course, because neighborhoods are social constructs that reflect the history, values and culture of the people who live in them.
In fact, research often cites statistics, characteristics and trends by neighborhood. But in reality, the delimiter used is almost always some kind of neighborhood surrogate, like a census tract. When true neighborhood boundaries are overlaid onto census tracts for the same area, it’s clear that there is far from a one-to-one correlation.
So, for researchers to adjust U.S. Census geography to conform to areas local citizens identify with, they would need to manually aggregate block groups or census tracts together to align with what people on the ground would consider true neighborhood boundaries. In other cases, census tracts would have to be split to accurately reflect true neighborhood boundaries. For research purposes, neighborhood boundaries would need to be determined and then redrawn. This may be possible at a very small scale but is generally not feasible for larger geographic areas due to the time and expertise needed. At the very least, researchers would need to somehow translate tract numbers to neighborhood names — no trivial task. It isn’t generally meaningful when illustrating a point to say something like, “… as we can see from the results in census tracts 36061006300 and 36061005600 …”
An argument can be made that in small areas, there won’t be a significant statistical difference between using census tracts or block groups compared to true neighborhoods. But it really depends on the area of study. And in many instances, alternate geography can be used to augment traditional methods. After all, looking at intractable problems and policy questions in new ways is the only way to come up with new solutions and ideas.
So how can data be tagged, aggregated and analyzed by neighborhood?
Nationwide Neighborhood Boundaries Data Set. In recent years, geographic data sets have been developed to map tens of thousands of neighborhoods across the U.S. and abroad. Neighborhoods are informal in nature and don’t necessarily follow administrative boundaries or physical features. And while not all local citizens would agree on the exact borders for any given neighborhood, multiple sources can be used to represent a consensus view of the boundaries.
Other Alternate Geography for Small Area Analysis. In addition to neighborhood data sets, there are other alternatives. While neighborhoods are a recognized geographic unit in urban areas, other spaces are important across the suburban landscape. In terms of residential real estate, much of the development in the U.S. during the last half century has been organized around subdivisions — which can include everything from a few homes within a gated community to a development with hundreds of properties. Attributes tied to subdivisions impact everything from quality of life to housing values.
A common research topic is education. Whether stratifying a sample by education level or examining the impact of funding levels on student performance, the relationship between numerous variables and education can be significant. In terms of geography, when looking at the public education system, researchers can use school district boundaries from the U.S. Census. But school districts often cover large areas (nearly 300 square miles on average) and have heterogeneous populations — which can make drawing conclusions about data aggregated by school district difficult.
An alternative geographic entity — and one that is significant for many research questions — are the areas that define which households attend specific public schools. These attendance zones, or catchment areas, have only been available from local school authorities until recently. But there is now detailed attendance zone data available for schools covering more than 70 percent of the U.S. student population.
There are two primary approaches to conducting analysis based on the alternate geographic entities discussed above. Direct methods simply add attributes to data records to assign the proper geographic entity and indirect methods perform some type of translation of data organized by standard entities to alternatives.
Direct. For studies that include source data collection (versus using pre-existing data sets), researchers can simply tag data points with the appropriate alternate geography as it’s collected. Also, any data that can be geocoded (basically, data with an address or even just ZIP code) or that is already geotagged (has latitude/longitude associated with it) can be related directly to any type of geographic entity — including the alternate areas discussed previously. For example, using the address of a set of health clinics can be geocoded and once the latitude/longitude is determined, the set’s location can be resolved to the boundary it falls within. With the proliferation of GPS-enabled devices, there is now a massive amount of geotagged data available. Everything from point-of-sale data to individual tweets are tagged with a lat/lon attribute and can be resolved to and then analyzed by virtually any geographic entity.
Indirect. In many cases, researchers must combine one or more pre-existing data sets or join collected data to demographics and other statistics that are only available in standard Census geographic areas. In these cases, it’s often still possible to use a variety of statistical and spatial processing methods to transpose data from Census areas to alternatives that are more meaningful for evaluation. For example, if basic demographics are needed as part of data analysis and the data is only available by block group, this data can be transposed to neighborhood areas using several techniques. One approach would be to take the geographic center-point (i.e., centroid) of the block groups and determine which neighborhoods they fall within and aggregate the data accordingly. Or, if more precision is required, the overlay of two sets of geographic entities can be calculated to assign demographic values based on overlay proportions.
There are so many ways that alternate geography can be applied to answer interesting research questions and address policy and funding decisions. Even if only a subset of the data in a given study is examined in new ways — it may provide new insights into age-old questions. Here are several examples of how research or policy decisions might be improved by looking at data in a new way.
Health Policy. The U.S. Centers for Disease Control and Prevention track the spread of infectious diseases. Geography is an important element given the nature of how infections spread among populations. In many ways, proximity is the key determinant in looking at concentrations and movement of contagions. Proximity is an easy variable to consider in analysis. A simple radius approach can be used to draw virtual perimeters around infection clusters.
Of course, proximity is a function of social ties and tendencies — and neighborhoods represent a unit of geography that reflects social groupings. In this way, neighborhoods are natural population boundaries that can be useful in looking at how diseases spread. Since school-age children are also a key factor in the spread of infections, another geographic entity that can be used by epidemiologists is the school attendance zone. Adding a geographic layer that shows the exact households from which children attend public school can provide meaningful data that allow health-care professionals to understand trends at a deeper level and take corrective action more quickly.
Consumer Lending. In 1977, the Community Reinvestment Act (CRA) was passed to help ensure banks offered services and credit in all areas — including low-income regions. The CRA created a set of self-reporting requirements for banks to demonstrate compliance. Because the CRA is tied directly to geographic areas and socioeconomic data, it makes sense that regulators would dictate that banks use Census geographic units as a way to group data in compliance reporting. While true neighborhoods can’t necessarily be substituted for census tracts in regulatory reporting, they can provide an interesting way to examine trends and contrast data sets. This kind of analysis is useful for governing bodies and the financial institutions themselves. Imagine if financial products and services could be tailored and marketed based on the population characteristics and preferences of true neighborhood areas. This type of target marketing can take advantage of the social connections inherent in locally defined spaces.
Crime. Every year, thousands of studies are conducted that examine crime in the U.S. America incarcerates a higher percentage of its population than any nation. And crime is linked to many
other socio-economic variables. There is a growing trend to tag crime incidents with location data. Local communities are using this data to display crime statistics in interactive Web maps and to make citizens more aware. In large urban areas, there is so much data that showing individual incidents is overwhelming and as a result, metro areas must be divided into areas with statistics summarized for each. What better way to segment and present data than terms that local residents would use — neighborhoods. Similarly for research conducted at the national, regional or metro level, slicing and dicing crime statistics by neighborhood offers a great way to align results to the geographic entities that reflect local cultural distinctions and norms.
NEW ON THE PODCAST