The Census Bureau has announced a new set of standards and methods for disclosure control in public use data products. According to the Census Bureau, the new approach, “marks a sea change for the way that official statistics are produced and published” and represents "the death knell for public-use detailed tabulations and microdata sets as they have been traditionally prepared.” The reason for these changes is concern about respondent confidentiality, even though the decennial census and American Community Survey (ACS) research data files have an unblemished record of confidentiality. As the Census Bureau acknowledges, there has never been a single documented case where the identity of a respondent in the ACS or decennial census has been revealed by someone outside the Census Bureau.
IPUMS is concerned that scientists, planners, and the public will soon lose the free access we have enjoyed for the past six decades to reliable public Census Bureau data describing American social and economic change. This page reports what we have learned about the new data products.
Use this form to join our mailing list for updates on the Bureau’s evolving plans and to tell us about how the proposed changes might affect your research.
ACS Small-Area Data
DIFFERENTIAL PRIVACY IN THE 2020 CENSUS
The Census Bureau has already begun using a new disclosure avoidance system for the summary files of the 2020 Census. These data files cover a limited range of subjects, since the census asks only a few questions, but they are still one of the nation’s most used public data resources, essential for redistricting, allocation of funds, urban and regional planning, and studies of residential segregation. Given the complete coverage of the decennial census, these data provide a crucial baseline for surveys and estimates throughout each decade. They are also the only source of high-quality nationwide data for small areas, for which survey sample sizes (from the American Community Survey or other sources) are typically too small to produce reliable estimates.
The Census Bureau plans to release only "differentially private" data from the 2020 Census. These data will have intentional errors added to nearly all statistics, including even the total populations of all geographic units below the state level.
The Census Bureau justifies the new disclosure controls by citing the threat of database reconstruction, which is a technique for inferring individual-level responses from tabular data. Our analysis, however, determined that the threat of database reconstruction was minimal. The Census Bureau's attempt to reconstruct the 2010 Census from published tabulations was incorrect in most cases, and did not perform much better than random guesses of people's characteristics. As Acting Director of the Census Bureau Ron Jarmin concluded, “The accuracy of the data our researchers obtained from this study is limited, and confirmation of reidentified responses requires access to confidential internal Census Bureau information … an external attacker has no means of confirming them."
To allow others to assess the impact of differential privacy on data usability, the Census Bureau has produced a series of demonstration products, each providing a different version of differentially private 2010 census data that users can compare with the originally published 2010 data. The demonstration data released in June 2021 are based on the production system for 2020 Redistricting Data, so the added errors in this demo product are representative of those in published 2020 data tables.
IPUMS, along with collaborators at the University of Washington, the University of Tennessee, and NORC at the University of Chicago, received a grant from the Alfred P. Sloan Foundation to analyze the demonstration files, and other groups from Harvard and CUNY also undertook analyses. These studies investigated only earlier versions of the demo data, and not all of the studies have been publicly released, but results thus far suggest that the new disclosure avoidance system will have adverse impacts for redistricting and for many research applications.
- Feedback on the April 2021 Census Demonstration Files. Van Riper, Schroeder, and Ruggles
- Does the Quality of the Census April 28, 2021, Census Demonstration Product (with an Epsilon of 12.2) Mean that Such a Product Would Be “Fit for Use” for Redistricting? Beveridge
- The Impact of the U.S. Census Disclosure Avoidance System on Redistricting and Voting Rights Analysis. Kenny et al.
- State of Washington Feedback on the April 2021 Census Demonstration Files. Mohrman.
SYNTHETIC MICRODATA FROM THE AMERICAN COMMUNITY SURVEY
The American Community Survey (ACS) microdata is by far the most intensively-used dataset disseminated by IPUMS and is a core dataset across social science and health research. Common topics of analysis include poverty, inequality, immigration, internal migration, ethnicity, disability, transportation, fertility, marriage, occupations, education, and family structure.
At the April 2021 ACS Data Users conference, the Census Bureau announced that it will replace the ACS research data with “fully synthetic” data over the next three years. A week after the conference--after an uproar on Twitter--the Census Bureau backtracked, and now says that there is no firm timeline on implementation of simulated ACS data. In a December 2022 announcement, the Census Bureau indicated they were researching the feasibility of creating synthetic microdata and an accompanying validation service. They “expect a multiyear development period, including data user review and feedback, that will extend beyond 2025.” They have not yet announced a formal process for evaluation of the change, as is required under the Administrative Procedures Act.
Although the Bureau has now extended its research timeline indefinitely, it appears to remain committed to a long-term plan to publish only synthetic microdata publicly. The idea is to develop statistical models describing the interrelationships of the variables in the ACS and then construct microdata for a simulated population consistent with those models. Such modeled data captures relationships between variables only if they have been intentionally baked into the model. Accordingly, synthetic data are poorly suited to studying unanticipated relationships, which impedes new discovery. Most analyses currently conducted with the ACS are likely to become impossible with the shift to synthetic data. For example, the ACS makes it easy for investigators to measure ethnic intermarriage, or the impact of a partner’s education on women’s fertility. The synthetic data would likely incorporate only individual-level interrelationships among variables, in which case analysis across household members would be impossible.
The Bureau apparently recognizes that the synthetic ACS microdata will not be adequate for research. The Bureau therefore proposes a system whereby investigators would develop analyses using synthetic data, and then submit them to the Census Bureau for “validation” using real data. This would preclude exploratory analyses on the real data, and would probably be logistically infeasible.
SMALL-AREA DATA FROM THE AMERICAN COMMUNITY SURVEY
The Census Bureau previously announced that the ACS summary data would also be made "formally private" by 2025. In December 2022, following inquiries from data users, the Census Bureau reported that “the science does not yet exist to comprehensively implement a formally private solution for the ACS” and indicated that the development period would extend beyond 2025.
Updates & Research Reports
- Census Bureau Disclosure Avoidance System (DAS) Updates
- David Van Riper, "Differential Privacy and the 2020 Decennial Census"
- David Van Riper, "Differential Privacy and the Decennial Census"
- Ruggles, Steven, Catherine Fitch, Diana Magnuson, and Jonathan Schroeder. 2019. "Differential Privacy and Census Data: Implications for Social and Economics Research.” AEA Papers and Proceedings, 109 : 403-08.
- Task Force on Differential Privacy for Census Data, Implications of Differential Privacy for Census Bureau Data and Research
- Census Bureau 2020 Data Release Schedule
- Ruggles, S and Van Riper, D. 2021. "The Role of Chance in the Census Bureau Database Reconstruction Experiment." Population Research and Policy Review
- Hauer, M. E., & Santos-Lozada, A. R. (2021). Differential privacy in the 2020 census will distort COVID-19 rates. Socius.
- Kenny, C. T., Kuriwaki, S., McCartan, C., Rosenman, E. T., Simko, T., & Imai, K. (2021). The use of differential privacy for census data and its impact on redistricting: The case of the 2020 US Census. Science advances, 7(41).
- Santos-Lozada, A. R., Howard, J. T., & Verdery, A. M. (2020). How differential privacy will affect our understanding of health disparities in the United States. PNAS, 117(24), 13405–13412.
- Winkler, R.L., Butler, J.L., Curtis, K.J. et al. (2021). Differential privacy and the accuracy of county-level net migration estimates. Population Research and Policy Review.
- NASEM Committee on National Statistics. 2020 Census Data Products: Workshop on the Demographic and Housing Characteristics Files
- Asquith, B. et al. (2022). Assessing the Impact of Differential Privacy on Measures of Population and Racial Residential Segregation. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.5cd8024e.
- Hotz, V. J. & Salvo, J. (2022). A Chronicle of the Application of Differential Privacy to the 2020 Census. Harvard Data Science Review, (Special Issue 2). https://hdsr.mitpress.mit.edu/pub/ql9z7ehf/release/7?readingCollection=63678f6d.
- Hotz, V. J. et al. (2022). Balancing data privacy and usability in the federal statistical system. Proceedings of the National Academy of Sciences, 119 (31) e2104906119.
- Census Bureau DAS Demonstration Data & Metrics
- IPUMS NHGIS Privacy-Protected Demonstration Data
- In these data files, NHGIS has linked together two versions of 2010 Census summary tables: (1) original tables from the 2010 Census Summary Files, and (2) new tables based on different vintages of the Census Bureau's differentially private demonstration data.
- The 2021-06-08 vintage includes demo data based on the final production system for 2020 Redistricting Data, so this vintage may be used to model the error distribution in published 2020 data tables.
- The 2022-03-16 and 2022-08-25 vintages include demo data based on the disclosure avoidance system designed for the Demographic and Housing Characteristics File (DHC), which corresponds to Summary File 1 from previous decennial censuses. These demo data, like the planned DHC, include tables on sex, age, race, ethnicity, household and group quarters type, and housing tenure. The DHC file is now slated for release in May 2023.
We will continue to gather relevant information for the IPUMS user community and post here and share via IPUMS Twitter.