September 17, 2002
Before the United States House of Representatives
Committee on Government Reform
Subcommittee on Government Efficiency, Financial Management and Intergovernmental Relations
Mr. Chairman, Members of the Subcommittee, thank you for inviting me to testify on the "Confidentiality Information Protection and Statistical Efficiency Act of 2002," H.R. 5215. I ask that my written testimony be entered into the record.
I am the Project Manager of The Heritage Foundation's Center for Data Analysis (CDA). I help direct the work of researchers who routinely use a wide variety of data supplied by the federal government. In addition, the CDA has entered into licensing agreements with a few federal agencies that permit our analysts to use data that are not generally available to the public.
Although The Heritage Foundation is recognized as a conservative public policy research institution, our analysts work with those from diverse ideological perspectives on issues involving access to quality data. This is the reason why The Heritage Foundation is a member of broad-based organizations such as the Association of Public Data Users (APDU) and is an affiliate member of the Council of Professional Associations on Federal Statistics (COPAFS). It should be noted that the following testimony is my own view and does not necessarily reflect that of The Heritage Foundation, or any other organization.
Three standards for improving federal statistical
Government statistics are an indispensable component to much of the work done by policy makers. Obvious examples include economic indicators such as inflation and unemployment and budgetary estimates involving taxes and the overall level of spending. Crime, education and health care are just a few of the other public policy areas in which statistics are regularly used to better understand social problems and evaluate programs that may affect them.
Today, I would like to discuss three standards that should guide any proposal to improve America's statistical system. These standards are: (1) protection of individual identity for the respondents who provide original data, (2) production of useful, timely information for data users, and (3) independent evaluations of the data for decision-makers. These are the three I's of statistical policy: Identity protection, Information value, and Independent evaluation.
The need to improve federal statistical policy is directly related to our nation's dependence on high quality statistics. Data sharing provisions, such as those contained in H.R. 5215, can improve the quality of economic statistics produced by the government. In addition, with appropriate modifications, the identity of those providing data can be better protected by confidentiality policies such as those in H.R. 5215. However, as I will explain later, it is crucial that the language used to protect confidentiality not inadvertently and unnecessarily eliminate the type of data access that is currently available. After allowing for reasonable adjustments to protect the identity of respondents, the public should have access to the greatest amount of data possible. In addition, data should be provided in a form that allows nongovernment researchers to provide alternative interpretations of information produced by the government's statisticians.
Two of the principles cited above have been applied in H.R. 5215. The sections concerning statistical efficiency contained in Title 2 are examples of measures that can enhance the value of information by improving the accuracy and timeliness of economic data. I have left the more detailed discussion of these issues to the economists and information providers who work daily with these data. My testimony will focus primarily on the identity protection aspects of Title 1. I will also discuss the importance of data access to nongovernment researchers.
Standard 1: Identity protection
Given the importance of numbers to government decision-makers, it is perhaps surprising that the federal statistical system is so fragmented and confusing. Individual agencies have been added to the U.S. statistical system over a period of many years and for different legislative reasons. Over 70 agencies participate in the collection, preparation, and dissemination of data collected from administrative records, surveys and censuses. While some agencies routinely generate wide-ranging products (e.g., the Bureau of the Census) others focus on more specific areas. In addition, statistics are produced as by-products in data collection associated with administrative tasks (e.g., the Internal Revenue Service).
The growth of America's statistical system has produced not only a confusing set of statistical agencies, it has also created an inconsistent set of laws and policies designed to protect the confidentiality of respondents who supply the government with data.1 Some of the interagency coordination problems between the Department of the Census, the Bureau of Labor Statistics, and the Bureau of Economic Analysis would be reduced by changes such as those in Title 2 of H.R. 5215.
In addition, the legislation provides a new set of definitions and protections of confidentiality that would apply throughout the government. Protections such as these are important because the federal statistical system faces a serious problem of declining public trust in government, specifically trust that a respondent's identity will be kept confidential and that respondents will not be harmed by the information they supply. A uniform policy to protect the confidentiality of data providers is basic to the development of high-quality data. Unless respondents can be assured that the data they provide to the government for statistical purposes will not be used against them through regulations or other enforcement efforts, they will either not provide data or they will report inaccurate information. In either case, the effect is to create measurement biases and errors.
Unfortunately, Congress is not actively considering any proposal that would replace the current system with a coherent and comprehensive set of rules for the protection of confidentiality. Nevertheless, standards such as those in H.R. 5215 provide a framework for resolving these differences in the future. An important first step is to clearly distinguish between statistical and administrative data.
The government collects a vast amount of administrative data in conjunction with federally funded programs. With appropriate safeguards, these data can be used for research purposes. For example, administrative data can be used to determine whether federal job training programs are effective in raising the incomes of workers. However, data collected for statistical purposes should rarely, if ever, be used for administrative reasons.
Those who provide data to statistical agencies should not have to worry that the government will use their individual responses to decrease a monthly benefit check, increase their tax liability, or impose a fine for violating a government regulation. Confidentiality protections that clearly distinguish between statistical and nonstatisitcal purposes, such as those found in H.R. 5215, will help reinforce this important difference.
Statistical agencies must also protect the identity of individuals who provide data that may eventually be released to the public. Agencies protect confidentiality by modifying or suppressing data that could be used to directly or indirectly identify an individual respondent. Items such as names, addresses and identifying codes such as social security numbers are removed from publicly available databases.
In addition, reasonable steps are taken to ensure that
statistical disclosure does not occur. Statistical disclosure
can occur if the information that is released is so detailed,
analysts can, with a high degree of probability, associate the
information with a specific person or business.
Statistical agencies use procedures to alter data in order to
reduce the chance that this type of disclosure will occur.
Examples of these adjustments include cell suppression, the random
modification of data, and the use of topcoding.2 The effect
is to produce a database that is similar to the original file but
with anonymous information. Data in this form limits the risk
that the identity of respondents can be exposed through indirect
means. Provisions for protecting individual identities can be
found in plans such as H.R. 5215, which prohibit the release of
data in a form that could reasonably be expected to either directly
or indirectly yield the identity of a respondent.
Standard 2: Information value
Although necessary, procedures that protect confidentiality also tend to reduce the amount and the value of data that can be released. Technical adjustments to the data by statistical agencies reduce the usefulness of data that is available to the public and researchers. It is vital that the methods adopted to protect individual identity do not inadvertently or unnecessarily reduce the amount of information available to the public.3
It is important that a distinction be made between a respondent's identity and the data they provide. Individual-level data are often referred to as microdata files because they contain information about individual persons, families, business entities or some other individual unit. They include items such as age, race, sex, education levels, income and expenses. Examples of these files include the Current Population Survey, the Consumer Expenditure Survey, and the Survey of Consumer Finance. These files provide the basis for much of the social and economic research conducted by analysts in academic institutions and in public policy organizations. This research depends on convenient access to individual-level data.4
Provisions to protect confidentiality are intended to shield the identity of the respondent but not suppress all data at the individual level. It is not necessary to adopt such extreme forms of data suppression as those found in H.R. 5215. As currently written, this bill states that agencies cannot disclose data that are in "identifiable form." The bill further defines data in "identifiable form" to mean the representation of information that permits information about a specific respondent to be reasonably inferred through either direct or indirect means. This method of protecting confidentiality precludes the disclosure of all individual-level information that respondents provide despite the use of safeguards that protect the identity of the respondents. Denying researchers access to all the individual-level data would drastically reduce the value of publicly available information and undermine the quality of important research performed in the United States.
The problem with the approach taken in H.R. 5215 arises because it does not clearly distinguish between the identity of the individual respondent and the information they provide. Protection of confidentiality requires that the identity of the individual be kept confidential. However, other information that is currently available to researchers should remain accessible. Confidentiality protections such as those in H.R. 5215 should be modified so it is clear that they protect the identity of respondents.
Data providers often refer to a tension between the protection of individual identity and the degree of information usefulness. On the one hand, government statisticians want to reassure respondents who provide data. On the other hand, they would like to fulfill legitimate requests for data by users. The tension is often depicted by statisticians in a graph where the risk of disclosure is measured on one axis and the amount of information provided is measured on the other axis.5 The graph shows a trade-off in which a lower level of disclosure risks leads to a reduction in the amount of information that can be provided. The goal is to strike a balance that provides reasonable protections for confidentiality and the greatest amount of useful data. Although helpful, graphs that only plot disclosure risks and the usefulness of data omit the role that data plays in protecting our form of government.
Standard 3: Independent evaluation
Although providing valuable data is a very important standard, it is not enough for government statisticians to view data access solely in terms of the amount of data they provide to the public. In addition, the data should be sufficient so that researchers outside the government can respond effectively to government proposals - either to validate or to challenge them. To function properly, the U.S. government depends on the ability of potentially opposing interests to influence the decision-making process and thereby reach a more informed and reasoned outcome. The U.S. system of government was designed with checks and balances, and depends for its effectiveness on the free flow of information.
There is a subtle but critical difference between a standard for the quality of information that is provided and a standard that deals with the form in which it is provided. Government statisticians may supply the public with a large quantity of valuable data but this information typically comes packaged in numerical aggregations and generalized categories. If nongovernment researchers are to provide an independent evaluation of official government data, they must have access to information that is similar to that used by government statisticians. Without this access, a basic U.S. principle of open government, reflected in the U.S. Constitution and in many laws, most notably the Freedom of Information Act (FOIA), will be violated. The U.S. government was designed to be of and for the people, not to be run by an elite with the unique ability to choose how data are to be categorized, processed, and released.
A few examples may help clarify why the distinction between the amount and form of data accessibility makes a difference. I have selected two studies conducted by Heritage's data center and ask that they be included in the record.6 Although these are Heritage publications, I must point out that public policy analysts commonly produce this type of research and I could have selected from a large number of studies from individuals associated with universities and nonprofit organizations.
The first report is an analysis of the distribution of income in the United States. The authors of this study identify four weaknesses with the official measurements of income inequality used by the Census Bureau. For example, the quintiles that Census uses to divide income do not contain an equal number of people. In addition, the conventional Census figures do not take into account the effects of taxation and omit many types of cash and non-cash income. Because the underlying Census data are publicly available, Heritage analysts were able to make the adjustments they believed were appropriate to recompute the distribution of income. The revised analysis shows a more even distribution of income than that contained in official Census reports.
A second Heritage report asked what share of child poverty can be attributed to the growth of single parenthood since the 1960s. As with the previous study, analysts used data in a form similar to that available to Census statisticians. The report notes that "The March 2001 [Current Population Survey] supplement, also known as the annual demographic file, includes extensive questions on family demographic characteristics and previous year income that make it useful for social analyses, such as this one."7 Heritage analysts utilized the Census data to estimate the effects that marriage rates have on poverty. They were also able to use an expanded definition of income that counts the Earned Income Tax Credit and food stamps as part of a family's resources for determining whether the family is poor.
Examples of similar research can be found in Heritage reports on education, taxation and the Social Security system. And, more important, other public policy analysts who have divergent political perspectives rely on the same type of data. Although statistical agencies often state that they are committed to providing access that allows for independent evaluations there are few regulations or laws that require them to do so. Authors and sponsors of federally funded program evaluations seem particularly reluctant to release their data sets to independent researchers.8 Requiring public access to program evaluation data encourages government evaluators to apply more rigorous methods than would otherwise be the case. If we are to have open and informed debate on public policy issues it is vital that all researchers have access to data that permit them to challenge the government's official reports and to offer alternative perspectives.
What Congress Should Do
To implement the three statistical standards described in this testimony, Congress should:
1. Joe Cecil, Senior Research Associate at the Federal Judicial Center, notes that "Records maintained by U.S. federal agencies are governed by a web of federal statutes that are 'inconsistent at best and chaotic at worst" (Commission on Federal Paperwork, 1977). The exchange of statistical information must conform to standards that often were designed to guard against administrative abuses, standards that may be inappropriate for records used only for statistical purposes. As a result, researchers who seek information maintained by federal agencies often must recast their request for access in terms of a regulatory scheme that does little to anticipate the special characteristics of statistical data." See Joe S. Cecil, "Confidentiality Legislation and the United States Federal Statistical System," Journal of Official Statistics, Vol. 9, No. 2, 1993, p. 519.
2. For review of the adjustments that statistical agencies employ and the possible effects they may have on the usefulness of data see articles in Pat Doyle, Julia I. Lane, Jules J.M. Theeuwes, and Laura V. Zayatz, editors, Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (New York: Elsevier Science, 2001).
3. Some agencies are allowed to provide data to external researchers through data licensing or use agreements. These licenses extend the legal responsibilities for handling confidential data to the external researcher. They can be an effective means of preserving respondent confidentiality without significantly affecting the quality of research that can be performed off-site by nongovernment analysts. For a review of licensing arrangements see: Marilyn M. Seastrom, "Licensing," pp. 279-289, in Doyle, et. al., editors, Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Other alternatives, such as making researchers special sworn employees, are much less effective in providing data access. The access provided is time-consuming to obtain, costly, temporary and must be carried out at a remote site. In addition, special requirements often limit the research to those subjects that further the mission of the statistical agency.
4. This issue was considered by the members of The Panel on Confidentiality and Data Access of the Committee on National Statistics. They warn that efforts by statistical agencies to protect confidentiality could significantly reduce the value of the data. "Because of legitimate concerns about the possibility of disclosure of individual information, statistical agencies have limited the amount of detailed data provided to nongovernment users in tabulations and public-use microdata files. This lack of detail restricts the ability of users to do analyses that could contribute to the understanding of significant economic, social, and health problems." The panel recommended that "Statistical agencies should continue widespread release, with minimal restrictions on use, of microdata sets with no less detail than currently provided." See George T. Duncan, Thomas B. Jabine, and Virginia A. de Wolf, editors, Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (Washington, D.C.: National Academy Press, 1993), p. 7.
5. See, for example, various papers in: Doyle, et. al., Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies.
6. See the attached reports: Robert Rector and Rea S. Hederman, "Income Inequality: How Census Data Misrepresent Income Distribution," The Heritage Foundation, Center for Data Analysis Report, September 29, 1999, and Robert Rector, Kirk A. Johnson, and Patrick F. Fagan, "The Effect of Marriage on Child Poverty," The Heritage Foundation, Center for Data Analysis Report, April 15, 2002.
7. Rector, Johnson, Fagan, "The Effect of Marriage on Child Poverty," p. 3
8. For example, The National Job Corps Study, funded by the Department of Labor (DOL) and authored by Mathematica Policy Research (MPR), was published in July 2001. The DOL and MPR have denied requests to release the data used for the study. In addition, the Community Oriented Policing Services (COPS) refused a FIOA request by The Heritage Foundation to release data from the National Evaluation of the Effect of COPS Grants on Crimes from 1994 to 1999.
The Heritage Foundation is a public policy, research, and educational organization operating under Section 501(C)(3). It is privately supported, and receives no funds from any government at any level, nor does it perform any government or other contract work.
The Heritage Foundation is the most broadly supported think tank in the United States. During 2001, it had more than 200,000 individual, foundation, and corporate supporters representing every state in the U.S. Its 2001 contributions came from the following sources:
Investment Income 1.60%
Publication Sales and Other 2.84%
The top five corporate givers provided The Heritage Foundation with less than 3.5% of its 2001 income. The Heritage Foundation's books are audited annually by the national accounting firm of Deloitte & Touche.
Members of The Heritage Foundation staff testify as individuals discussing their own independent research. The views expressed are their own, and do not reflect an institutional position for The Heritage Foundation or its board of trustees.
Ralph A. Rector, Ph.D. is Research Fellow and Project Manager at The Heritage Foundation's Center for Data Analysis (CDA). The CDA conducts research and publishes empirical studies on issues such as education, crime, welfare, and public finance. Rector directs CDA research and development activities, including the development of new computer software and databases. He serves on the Board of Directors of the Council of Professional Associations on Federal Statistics (COPAFS). Before joining Heritage, he worked in the Tax Policy Economics Group at Coopers & Lybrand, L.L.P., where he supervised the construction of microsimulation models used to analyze the impact of tax reform on businesses and individuals. He has managed projects involving the use of large-scale relational databases and economic models. He has also served as a tax analyst and revenue estimator at the state and federal levels. Rector holds a Ph.D. in economics from George Mason University.