June 9, 2000 | Center for Data Analysis Report on Education
Do small classes make a difference in the academic achievement of elementary school students? From the attention given this subject by politicians, it would be reasonable to assume that class size has been shown to be essential to good academic outcomes. Congress, for example, allocated $1.3 billion for the "Class Size Reduction" provision of the Elementary and Secondary Education Act (ESEA) in fiscal year 2000. The Clinton Administration has requested even more funding for FY 2001.1 And there are proposals to pump large sums of money into efforts to increase the number of teachers in public elementary schools in order to decrease the ratio of students to teachers.2
This report uses data from the 1998 National Assessment of Educational Progress (NAEP) reading examination to analyze the effect of class size on academic achievement. The NAEP provides the most comprehensive database on educational outcomes available to researchers. Among the major findings of this analysis of NAEP data are that:
Most Americans believe that educating children in smaller classes would improve educational outcomes. Indeed, according to an NBC News/Wall Street Journal poll taken in March 1997, some 70 percent of adults believe that reducing class size would lead to significant academic improvements in public schools.3
But elementary and secondary school class sizes have fallen steadily over the past few decades. In 1970, public schools averaged 22.3 students per teacher nationwide. By the late 1990s, however, public schools averaged about 17 students per teacher, due to a combination of demographic trends and conscious policy decisions to lower the ratios.4
Over the same period, however, academic achievement, as measured by the NAEP exam, stayed relatively constant. Achievement for all three grades (fourth, eighth, and twelfth) that take the NAEP tests may vary slightly from year to year, but as shown in Chart 1, the average score on the reading test has changed very little over the past 25 years. At face value, this record of "stability" may not be sufficient evidence to conclude that the decline in class size has had no influence on test scores. It does, however, illustrate the trend in academic achievement over time in America's schools.5
The academic literature on the impact of low class size on academic achievement has been decidedly mixed. One of the most frequently cited reports on class size is Frederick Mosteller's study of young elementary school students in Tennessee. 6 Mosteller found a significant difference in achievement between the students in classes of 15 students per teacher and those in classes of 23. Recently, however, University of Rochester economist Eric Hanushek has questioned the results of this study, noting that "the bulk of evidence...points to no systematic effects of class size reductions within the relevant policy range." 7 Of the studies that do demonstrate some statistically significant gains in achievement, most generally involve substantial reductions in class size. 8 However, none of the current national policy proposals would massively shrink class size. 9 Clearly, more research is needed on this subject.
How to Interpret These Findings
This report contains the results of statistical tests that use NAEP data to explain differences in reading test scores. These statistical tests isolate the independent effects of a number of factors on reading scores (such as the education of parents) in order to determine whether class size matters to these test scores. The statistical tests (or correlations) cover data on a wide array of school children, as defined by their race, income, and other socioeconomic characteristics. Because the statistical model used here includes these socioeconomic characteristics, the reader can interpret these findings as applicable to each of these groups of students. Thus, the findings about class size and reading scores apply as much to upper-income as to lower-income students, to blacks as to whites, to girls as to boys, and so forth.
These correlations suggest that there is a statistical relationship between the factor and achievement in reading, but they do not suggest that these independent factors cause differences in academic achievement.
The variables in the model came from the NAEP database and do not include everything that might have an effect on academic achievement, such as the methods used to teach reading. These factors may be much more important in general, or for a particular child, than the factors recorded in the NAEP data. Moreover:
The author used the 1998 NAEP database of reading to measure the influence of class size on academic achievement. The NAEP, first administered in 1969, is an examination that measures academic achievement in a variety of fields, such as reading, writing, mathematics, science, geography, civics, and the arts. Currently, the NAEP is administered to fourth, eighth, and twelth grade students, with the main tests in math and reading given alternately every two years. For example, reading was tested in 1998; math was assessed in 1996 and 2000.
The NAEP is actually two tests, a nationally administered test and state-administered tests. Over 40 states participate in the separate state samples used to gauge achievement within those individual jurisdictions. For the purposes of this study, only 1998 national reading data were used.
The most significant benefit of using the NAEP data is that, in addition to test scores in the subject area, it includes an assortment of background information for the students taking the exam, their main subject-area teacher, and their school administrator. Responses from the teachers and school administrators are linked to the student's information, which yields a rich database of information. The background questions include:
By incorporating this information with their assessment of the NAEP data, researchers can glean a great deal of evidence into the factors that explain the differences found in NAEP scores among children.
This analysis looked at academic achievement by analyzing six factors: class size, race and ethnicity, parents' educational attainment, number of reading materials in the home, free or reduced-price lunch participation, and gender. Using regression analysis, Heritage analysts can isolate the effect of each factor. The Heritage analysis uses a jackknifed ordinary least squares model 10 and looks at the effects of these factors on the NAEP 1998 nationwide sample of public school children. 11
Frederick Mosteller explains why small classes boost achievement: "Having fewer children in class reduces the distractions in the room and gives the teacher more time to devote to each child." 12 The average time a teacher can spend with each child, then, appears to be important in the learning process. To address class size, this analysis studies the NAEP data in two different ways (statistical models). The first compares the academic outcomes of children in the smallest classes (20 or fewer students per teacher) with those of all other students. The second only compares the children in these small classes with those in large classes (at least 31 students per teacher).
Race and Ethnicity
Many studies and reports have demonstrated that over time, African-American and Latino students tend to perform more poorly on standardized tests than do white students (although the gap has generally narrowed over the past 25 years). 13 There are a number of potential explanations for this trend. 14 Because strong differences exist in academic achievement among the races, the variables of race and ethnicity are included in the analysis.
Many researchers have noted that the educational attainment of a child's parents is a good predictor of their child's academic achievement. Parents who, for instance, are college educated could be better equipped to help their children with homework and understanding concepts than are those who have less than a high school education, all other things being equal. Because the education level of one parent is often highly correlated with the other's, only a single variable is included in the analysis.
Number of Reading Materials in the
The presence of books, magazines, encyclopedias, and newspapers generally indicates a dedication to learning in the household. Researchers have determined that these reading materials are important aspects of the home environment. 15 The analysis thus includes a variable controlling for the number of these four types of reading materials found at home.
Income is often a key predictor of academic achievement because low-income families seldom have the resources to purchase extra study materials or tutorial classes that may help their children perform better in school. While the NAEP does not collect data on household income, it does collect data on participation in the school free and reduced-price lunch program that are used here. 16
Empirical research has suggested that girls tend to perform better on reading and writing subjects while boys perform better on the more analytical subjects of math and science. 17 Many authors have expounded on this idea, 18 yet the data on the male-female achievement gaps are often inconsistent. In 1998, for example, young men scored higher than young women on both the verbal and quantitative sections of the Scholastic Achievement Test (SAT). Some writers have noted that this may be because of a fundamental bias against females in the educational system. 19 Another explanation, however, is that the test results reflect a selection bias in which more "at-risk" females opt to take the SAT relative to males who take it. 20 In order to account for this factor, the analysis includes a variable for gender.
These six factors formed the basis of two statistical models 22 that were applied to the NAEP's 1998 nationwide sample of public school children who took the reading test. 23 As noted above, the first model compares the data for children in small class sizes (20 or less students per teacher) to all other students. The second model only compares data for students reported to be in either small or large classes (classes with 31 or more students per teacher). By determining whether or not an achievement difference exists between the smallest and largest classes in America, the second model addresses the contention that there may be differences in achievement as the class size gap widens.
Chart 2 and Chart 3 show the percent change in fourth and eighth grade reading scores attributable to the factors in the first model, compared with a base case, while Chart 4 and Chart 5 show the percent change in the second model. 24 Here, the base case is defined as a child with the following characteristics:
The estimates of the base case are reported in Table 1 for both models. These are the scores that a hypothetical individual would score out of a maximum possible NAEP score of 500. Chart 2 through Chart 5 show the positive or negative percent changes for each variable, holding constant all other variables in the model.
In the first model, the analysis of the data on children in all class sizes shows no significant difference in reading test scores attributable to class size, holding all other variables constant. 25 As seen in Chart 2, NAEP scores of fourth grade children whose parents attended some college are 2.2 percent higher than scores for children whose parents have a high school education or less. Most important, moving from a class size above 20 down to 20 or fewer reduces NAEP scores by 0.8 percent, but this effect is statistically indistinguishable from no influence. While it may seem logical that lower class sizes would have a positive influence on achievement, the NAEP data do not support that conclusion. The second model, comparing children in small classes to those in large classes, reaches a similar conclusion. Again, class size does not have a meaningful impact on academic achievement.
For eighth graders, the class size variable is significant when comparing children in small and large classes. The results of the comparison are counterintuitive since the coefficient has a negative sign. Holding other variables constant, this means that eighth grade children in small class sizes do worse on the NAEP reading exam than do those in large classes. The magnitude of the effect is significant; in the base model, a child would score 1.7 percent less than the base case child. The variable is barely significant statistically, 26 however, and should be treated with suspicion.
Both fourth and eighth grade girls score slightly higher than do boys on the reading exam, which bolsters recent evidence on gender differences in academic achievement. Girls on average, notes American Enterprise Institute W. H. Brady Fellow Christina Hoff Sommers, "get better grades, are more engaged academically, and are now in the majority in higher education." 27 The results here support the contention that schools are not shortchanging girls. 28
Class size has little or no effect on academic achievement, according to this analysis of 1998 NAEP data. It is quite likely, in fact, that class size as a variable pales in comparison with the effects of many factors not included in the NAEP data, such as teacher quality and teaching methods. Observes Irwin Kurz, principal of the highly successful P.S. 161, a public school in Brooklyn, New York, that serves poor children and has an average class size of 35, it is "[b]etter to have one good teacher, than two crummy teachers any day." 29
Kirk A. Johnson, Ph.D. is a Policy Analyst in the Center for Data Analysis at The Heritage Foundation.
Table 2 and Table 3 report the results of an analysis of NAEP data using two statistical models. Table 2 shows the coefficients and significance tests for the first model, which compares data for all public school children in the analysis, while Table 3 reports the results for the model that compares only small classes (20 students or less per teacher) to large classes (31 students or more per teacher). As shown in these tables, most variables are statistically significant. 30 Contrary to conventional wisdom, the class size variable is not significant or has the wrong sign on the coefficient.
In this analysis, there are two statistical issues to consider. First, the NAEP exam is a long test and therefore is not administered in its entirety to all children. Rather, different parts are given to different children. Certain students will do better on certain portions of the test than others. Consequently, a "true" score must be estimated, or imputed, from the incomplete information. The NAEP estimates five plausible composite reading scores and recommends that researchers use all five in any analysis. The Heritage analysis described here follows the guidelines specified by the Educational Testing Service (which works closely with the National Center for Education Statistics in developing the data file) for incorporating all five reading scores into the analysis. 31
Second, the NAEP utilizes a complex sample design, oversampling children with certain characteristics. 32 Each child, then, is given a unique weight, which is calculated from the probability of being selected from the population at large (in this case, from the U.S. population of fourth or eighth graders in public schools). The NAEP's sample design requires a complex modeling technique, which the Heritage model employs. 33
2. The White House, "President Clinton Highlights Education Reform Agenda with Roundtable on What Works," press release, May 4, 2000, at http://www.pub.whitehouse.gov/urires/I2R?urn:pdi://oma.eop.gov.us/2000/5/5/8.text.1. See also The White House, "Remarks by the President in Roundtable on Reforming America's Schools," May 4, 2000, at http://www.pub.whitehouse.gov/uri-res/I2R?urn:pdi://oma.eop.gov.us/2000/5/5/14.text.1.
3. Hart and Teeter Research Companies, NBC News/Wall Street Journal Poll, March 1997; see Question 108, as cited in Eugene M. Lewit and Linda Schuurmann Baker, "Class Size," The Future of Children, Vol. 7 (1997), pp. 112-121.
4. U.S. Bureau of the Census, Statistical Abstract of the United States 1998 (Washington, D.C.: U.S. Government Printing Office, 1998), using data from National Center for Education Statistics, Digest of Education Statistics (1974 to 1998), published 1998.
5. University of Rochester economist Eric Hanushek does argue that if large class sizes are a problem today, they must have been a more serious problem in the past. See Eric Hanushek, "The Evidence on Class Size, " in Susan E. Mayer and Paul Peterson, eds., Earning and Learning: How Schools Matter (Washington, D.C.: Brookings Institution, 1999), pp. 131-168.
7. Eric Hanushek, "Some Findings from an Independent Investigation of the Tennessee STAR Experiment and from Other Investigations of Class Size Effects," Educational Evaluation & Policy Analysis, Vol. 21 (1999), p. 144.
9. One of President Clinton's social policy objectives is the funding of 100,000 new teachers; these 100,000 new teachers will not significantly change the nationwide student-teacher ratio. According to the Digest of Education Statistics, there were some 46.8 million public school students and 2.8 million teachers in 1997, rendering a student-teacher ratio of 16.8 to 1. If these 100,000 teachers were hired tomorrow, it would only cause the national student-teacher ratio to drop by 3.5 percent. Mosteller's research (absent the Hanushek critique) suggests that class sizes would have to drop by one-third before significant gains in academic achievement would be found.
10. Ordinary least squares is a general statistical regression technique that is often used by researchers. See Michael Lewis-Beck, Applied Regression: An Introduction (Beverly Hills, Cal.: Sage Publications, 1980), from Sage Publications' Quantitative Applications in the Social Sciences, Series No. 07-022. A jackknife is a complex resampling technique that is designed to accurately estimate statistical significance from surveys such as the NAEP that employ a complex sampling methodology. See Appendix A for the results and more information on the jackknifed ordinary least squares model.
13. For an analysis of the long-term achievement gap, see U.S. Department of Education, Report in Brief: NAEP 1996 Trends in Academic Progress (Washington, D.C.: U.S. Government Printing Office, 1997), Figure 2, p. 14.
18. For a brief discussion of this point of view, see Thomas Hancock et al., "Gender and Developmental Differences in the Academic Study Behaviors of Elementary School Children," Journal of Experimental Education, Vol. 65 (1996), pp. 18-39.
21. See, for example, Kirk A. Johnson, "Comparing Math Scores of Black Students in D.C.'s Public and Catholic Schools," Heritage Foundation Center for Data Analysis Report No. CDA99-08, October 7, 1999.
26. In technical terms, the t-test on the class size coefficient has a significance level that comes close to .05, or 5 percent. In light of the other results, and since one would expect those in smaller classes to perform better, researchers might question this result; however, that judgment is left to the reader.
28. Recent publications continue to advance the argument that schools, through accident or design, limit the success of girls. See, for example, American Association of University Women, ed., Gender Gaps: Where Schools Still Fail Our Children (New York: Marlowe & Co., 1998).
29. See Samuel Casey Carter, No Excuses: Lessons from 21 High-Performing, High-Poverty Schools (Washington, D.C.: The Heritage Foundation, 2000), pp. 74-77.
30. This means that these variables have no statistically discernable difference between the coefficient value and zero, so there is no effect. Statistical significance is usually pegged at a 5 percent or 10 percent level. See Lewis-Beck, Applied Regression: An Introduction.
31. From a multivariate regression perspective, the model below must be replicated five times using each of the plausible values individually and then averaging the resulting coefficients to yield the final model results. In technical terms, this process corrects for measurement error in the reading score variable, since the test administrators do not actually observe the test score from taking the exam in its entirety.
33. A procedure called a jackknife must be employed to correctly assess the variance of each variable's coefficient, and the NAEP database has a series of 62 "replicate weights" to aid in this task. These 62 jackknifes must be applied and the variances of each coefficient averaged for each of the five plausible test score models above (yielding a total of 315 models compiled for the purpose of this research). The WesVar Complex Samples software (produced by SPSS, Inc.) did much of this replication work. Using the jackknife results with the five plausible test score models allows for a variance correction mechanism. The purpose of the jackknife is to estimate a true sampling error. Correcting for the two types of error (measurement and sampling) allows for the most accurate estimates possible. See Bradley Efron, The Jackknife, the Bootstrap, and Other Resampling Plans (Philadelphia: Society for Industrial and Applied Mathematics, 1982), and Jun Shao and Dongsheng Tu, The Jackknife and Bootstrap (New York: Springer Verlag, 1995), for a more complete discussion of how this jackknife technique works.