August 31, 2001 By Seth Grimes
The analysis system uses SAS for data preparation and output processing, both of which involve extensive checks of data consistency and accuracy. Although scripting tools such as Perl or Python could fill these roles, SAS is more "data aware." In addition, SAS macro language can be and is used fairly easily to write dynamic programs driven by parameter files that describe data products.
The Unix platform offers the highest level of reliability given a heavy, heterogeneous processing load. A data set like Summary Form 1 takes about two months to compute on a fully loaded machine running SuperSTAR tabulations, Java/JDBC code building SuperSTAR databases, SAS programs, shell scripts, interactive sessions, monitoring utilities and nightly backups and supporting remote SuperSTAR query clients.
For census data users, understanding the data sets is the first hurdle. Refer to the data-product documentation, available online as a PDF file, to find out about statistical-table contents, data set record layout and field definitions. The product documentation also provides value lists, survey accuracy, geographic coverage and other important background information. Unfortunately, a "geographic identifier" is the only metadata -- data describing the data sets and their contents -- provided by the Census Bureau in a format suitable for direct loading to analysis tools, so users will need to build data dictionaries and extraction, transformation and loading procedures.
Power users will need to master data concepts and data set formats, and find capable analysis tools. Both SuperSTAR and SAS can handle sparse output data sets with tens of thousands of fields and millions of records. (Sparse data has a very high proportion of zero values.) But not every tool has this capability: Tables in an Oracle 9i database, for example, are limited to 1,000 fields. You can successfully work with census summary data using desktop spreadsheet and database tools -- the data sets are partitioned into segments of 256 or fewer fields to ease the way for desktop-tool users -- but youll have to be careful to load only subsets that cover your topical and geographic interests.
Many difficult usage issues are out of product-documentation scope, including establishing comparability between 2000 and 1990 results. The biggest comparability challenges are dealing with redrawn geographic boundaries, changes in racial classifications and figuring how to map 1990 numbers to 2000 numbers. The 2000 survey was the first decennial census to allow individuals to be reported as belonging to more than one race.
Although planning for the 2010 census has already started, theres more to look forward to before the next decennial round. Notably, the Census Bureau has created the American Community Survey (ACS), a "continuous measurement" instrument designed to replace the decennial-census long form. ACS began with a 1996 demonstration survey of four localities; a 1999 survey of 31 sites kicked off a phase comparing ACS and Census 2000 results. The full program will launch in 2003 with a sample of more than three million households covering every county in the United States. Once sufficient data is accumulated, ACS will facilitate creation of small-geographic-area demographic statistics that will be of great use to local governments. And the Census Bureau will conduct the next economic census, a survey of American businesses, in 2002. Both ACS and economic census results are disseminated through American FactFinder alongside decennial census data.
Census 2000 summary data should prove invaluable to state and local organizations and to the public. Theres a lot available already and more to come, yours to explore and download.
You may use or reference this story with attribution and a link to