Making Sense of the Census

Census 2000 data can prove valuable to state and local governments, if they know where to find it and how to manipulate it.

by / August 31, 2001 0
State and local governments and allied community and civic groups made an extraordinary effort to ensure the highest level of participation in the 2000 U.S. census. The stakes were high: The results are used to determine congressional district boundaries and the allocation of hundreds of billions of federal program dollars.

An added, but somewhat underutilized benefit is that the results are available for governments and the public to use however they wish. In fact, census summary results should prove as useful to state and local constituencies as they do to the federal government, informing all sorts of programming and planning decisions. So what data is available, and where can it be found?

Data Products
The first census results, the populations of the states, were released in December 2000. They have been followed by other population and housing statistics based on a survey of 100 percent of American households. Congressional redistricting data sets mandated by federal law were released in March 2001, followed by Summary File 1, a series of 286 detailed summary-data tables. Summary File 2, 47 tables produced for 250 iterations of race, ancestry and ethnicity, are slated for release this fall.

Data set releases are complemented by statistical briefs, which analyze particular topics and geographic areas, and by demographic profiles, which provide a concise summary of key statistics. Complete information on data products -- data sets, briefs and profiles -- and the release schedule are available

Census 2000 summary data sets are released on CD, DVD and the Internet. They are available for download and for interactive search, query and mapping via the Census Bureaus American FactFinder (AFF) Web site. The AFF site, which was built by IBM Global Services on contract to the Census Bureau, is an excellent tool for exploratory analysis, offering statistical tables and thematic maps and interfaces suitable for a range of users from school children to subject-matter experts.

AFF launched in early 1999, offering data from the Census 2000 Dress Rehearsal conducted in April 1998, the 1997 economic census and from the early phases of the Census Bureaus new American Community Survey. In light of early experiences, the bureau and IBM have enhanced the sites appeal, designing a cleaner interface without frames or cookies, refocusing on geographic areas rather than on particular surveys or data sets and adding convenient features such as an address locator mapping street addresses to census geographic areas. The site was also rebuilt to support thousands of simultaneous users. AFF is coded with Java servlets running in the IBM WebSphere application server accessing an Oracle 8i data warehouse. It runs on IBM RS/6000 SP clusters, one serving internal Census Bureau users and the other serving the public.

Production and Analysis
Census 2000 summary data set production poses difficult technical problems, compounded by extreme accuracy, accountability and security needs and a strict release schedule. While data set users will be able to fruitfully work with the data using common desktop software, the steps that the production team went through may be instructive for power users.

The bureaus analysis system runs on an eight-processor IBM RS/6000 M80 with 16GB memory and a four-terabyte disk storage system. The SuperSTAR analytical software suite from Space-Time Research of Melbourne, Australia, forms the heart of the system, providing a graphical user interface for Census Bureau users to compose tables and a fast tabulation engine. Although SuperSTAR is similar to many online analytical processing tools, it may be unique in its combination of ease-of-use and suitability for both ad-hoc and large-scale analysis of microdata classified according to large, hierarchical dimensions. For instance, census data are summarized according to geographic hierarchies with multiple branches that include up to eight levels between the highest and lowest levels and up to 750,000 elements in the case of Texas, the state with the largest number of census geographic areas.

The analysis system uses SAS for data preparation and output processing, both of which involve extensive checks of data consistency and accuracy. Although scripting tools such as Perl or Python could fill these roles, SAS is more "data aware." In addition, SAS macro language can be and is used fairly easily to write dynamic programs driven by parameter files that describe data products.

The Unix platform offers the highest level of reliability given a heavy, heterogeneous processing load. A data set like Summary Form 1 takes about two months to compute on a fully loaded machine running SuperSTAR tabulations, Java/JDBC code building SuperSTAR databases, SAS programs, shell scripts, interactive sessions, monitoring utilities and nightly backups and supporting remote SuperSTAR query clients.

For census data users, understanding the data sets is the first hurdle. Refer to the data-product documentation, available online as a PDF file, to find out about statistical-table contents, data set record layout and field definitions. The product documentation also provides value lists, survey accuracy, geographic coverage and other important background information. Unfortunately, a "geographic identifier" is the only metadata -- data describing the data sets and their contents -- provided by the Census Bureau in a format suitable for direct loading to analysis tools, so users will need to build data dictionaries and extraction, transformation and loading procedures.

Power users will need to master data concepts and data set formats, and find capable analysis tools. Both SuperSTAR and SAS can handle sparse output data sets with tens of thousands of fields and millions of records. (Sparse data has a very high proportion of zero values.) But not every tool has this capability: Tables in an Oracle 9i database, for example, are limited to 1,000 fields. You can successfully work with census summary data using desktop spreadsheet and database tools -- the data sets are partitioned into segments of 256 or fewer fields to ease the way for desktop-tool users -- but youll have to be careful to load only subsets that cover your topical and geographic interests.

Many difficult usage issues are out of product-documentation scope, including establishing comparability between 2000 and 1990 results. The biggest comparability challenges are dealing with redrawn geographic boundaries, changes in racial classifications and figuring how to map 1990 numbers to 2000 numbers. The 2000 survey was the first decennial census to allow individuals to be reported as belonging to more than one race.

Looking Forward
Although planning for the 2010 census has already started, theres more to look forward to before the next decennial round. Notably, the Census Bureau has created the American Community Survey (ACS), a "continuous measurement" instrument designed to replace the decennial-census long form. ACS began with a 1996 demonstration survey of four localities; a 1999 survey of 31 sites kicked off a phase comparing ACS and Census 2000 results. The full program will launch in 2003 with a sample of more than three million households covering every county in the United States. Once sufficient data is accumulated, ACS will facilitate creation of small-geographic-area demographic statistics that will be of great use to local governments. And the Census Bureau will conduct the next economic census, a survey of American businesses, in 2002. Both ACS and economic census results are disseminated through American FactFinder alongside decennial census data.

Census 2000 summary data should prove invaluable to state and local organizations and to the public. Theres a lot available already and more to come, yours to explore and download.
Seth Grimes Special to Government Technology