In 1998, University of Washington psychologist Anthony Greenwald and his colleagues developed a test that purports to uncover unconscious racism. Supposedly tapping into the unconscious, the Implicit Association Test (IAT) measures disparities in millisecond response times on a computer. Based on this, Greenwald and others claim that three out of four Americans suffer from unconscious racism.
Over the course of the past 20 years, the test has received significant media coverage in the Washington Post, New York Times, NPR, CNN, and PBS. By 2013, Greenwald and Harvard psychologist Mahzarin Banaji claimed that the “automatic White preferences expressed on the Race IAT is now established as a signaling discriminatory behavior.”
But there are many scientific critics of this test, and it is far from settled science. A growing body of research suggests that the test cannot predict real-world behavior.
To start, it is not clear that there are significant and reliable differences in response time, as has been asserted. When individuals take the IAT more than once, there is a good chance that results from the first and second (and subsequent) times have very low correlations. Perhaps this is to be expected from a test measuring differences in milliseconds: One-tenth of a second can lead to highly charged accusations of racism.
Next, the difference in milliseconds can be explained by factors other than unconscious bias. There are, simply speaking, a wide variety of other explanations. Rather than unconscious racism, the test could measure the test taker’s familiarity with pairings of words and pictures. Scientists who substituted familiar versus nonsense words in place of white versus black photos or names produced the same effect as the race IAT. Some behavioral scientists suggest the race IAT measures a “figure/ground” effect, where white faces and names are the familiar and fall into the background, while black faces and names are more distinctive, thus becoming more prominent.
Some critics note that the IAT does not distinguish between cultural stereotyping, knowledge of these stereotypes, and prejudice. In a similar vein, the IAT could measure knowledge of racial disparities, which in turn could generate anger, disapproval, or dismay—not necessarily endorsement or prejudice. In some test takers, the IAT could tap into a fear of being called a racist instead of being an unconscious racist.
There are many other factors that bias the test results, including knowing the purpose of the test, faking the test results, repeatedly taking the test, being in the presence of African Americans, cognitive quickness and flexibility, physical speed, and manual dexterity.
Other social scientists have raised the serious problems related to the low level of predictive power associated with the test. The test has not been shown to significantly predict discriminatory behavior. Test results are not closely related to any other measures of discrimination: Correlations are modest at best. Even a meta-analysis by its inventors found this to be the case.
Not surprisingly, the proportion of false positives may be substantial. Estimates of false positives range from 60 percent to 90 percent.
This high probability of error has led its original proponents to conclude that it should be used with caution: “Taken together, there is substantial risk for both falsely identifying people as eventual discriminators and failing to identify people who will discriminate.” In 2015, Greenwald, Banaji, and Brian Nosek, a University of Virginia psychologist, concluded that the IAT “risk[s] undesirably high rates of false classification.”
The claimed “proof” of unconscious but widespread racism can and will be used to justify any number of dubious policies. If the Implicit Association Test is used to support claims that decision makers in hiring and university admissions, housing, bank loans, and government contracting, among others, are unconsciously biased, then proponents will argue that this justifies the use of racial preferences—and even goals and quotas—to counterbalance this purported prejudice. Conversely, where such “affirmative action” is not used, or where there is any sort of racial disparity, these implicit-racism studies can be used to challenge selection decisions as discriminatory in lawsuits. These studies could be used as evidence of discrimination by law enforcement and to require minority “representation” of judges and on juries. “Unconscious bias” by teachers could be used to challenge their grading and discipline.The possibilities are endless.
Since the 1950s, public opinion on race has shown a decline in racial prejudice over time, with a momentous shift in white public opinion toward the principle of racial equality. Yet often, racial disparities in outcomes persist in income, education, home ownership, hiring, promotion, arrests and convictions, business ownership and contracting, and social mobility generally. This has led some social scientists, media commentators, and government officials to argue that there is widespread racism in our country, but it is unconscious.
Central to this movement has been an innovation in psychology that has garnered a great deal of publicity recently. Anthony Greenwald and his colleagues developed the Implicit Association Test and designed a series of experiments that purports to uncover the racism that still exists but in unconscious form. According to Greenwald and Banaji, 75 percent of Americans who take the IAT are found to be unconscious racists.
The IAT is an association test based on millisecond reaction time. It measures the speed with which a subject associates pleasant or unpleasant words such as “joy,” “crime,” or “work” with categories, for example, “black” or “white,” “male” or “female.” To start, Greenwald and his colleagues use the IAT to assess implicit attitudes toward socially neutral categories, such as flower versus insect, and pair these pictures with pleasant versus unpleasant words.
How the test works: Researchers instruct test takers to first hit the “positive” key when a flower appears on the computer screen and hit the “negative” key with insects, and to hit the “positive” key when pleasant words appear and hit the “negative” key with unpleasant words.
Researchers then switch the flowers/insects categories and instruct test takers to create “incompatible” pairings. Subjects are instructed to select the “positive” key when insects or pleasant words appear but select the “negative” key when flowers or unpleasant words appear.
The IAT found stronger associations, as measured by reaction speed in milliseconds, between combinations that were compatible versus those that were not. Pictures of flowers (e.g., a rose or tulip) combined with pleasant words and pictures of insects (e.g., a wasp or horsefly) combined with unpleasant words produced faster reaction times than the incompatible pairing of flowers and unpleasant words or insects and pleasant words.
From this assessment of socially neutral categories, Greenwald and his colleagues moved on to race. With this schema, they instructed test takers to hit the “positive” and “negative” keys, creating patterns of associations as follows: For the “compatible” set of pairings, test takers were instructed to pick the “positive” key when white pictures and pleasant words appeared, and to hit the “negative” key when black pictures and unpleasant words appeared. For the “incompatible” set of pairings, test takers were told to hit the “positive” key when black pictures and pleasant words appeared and to hit the “negative” key when white pictures and unpleasant words were flashed on the screen.
These combinations produced differential response times. On average, the “compatible” pairings generated faster reaction times than the “incompatible” combinations. This difference in millisecond reaction time is what led researchers to posit the existence of unconscious racism that caused the test taker to favor white over black when mixed with pleasant over unpleasant words, while taking longer when the pairings resulted in black over white when mixed with pleasant over unpleasant words.
After several years of IAT research, Greenwald, Banaji, and Nosek founded Project Implicit, a website for IAT researchers, consultants, and organizations interested in using the test and for individuals who want to take the test. Millions have accessed the test online.
Major media outlets such as the Washington Post, New York Times, and CNN have profiled the IAT, with such eye-catching titles as, “Across America, Whites are Biased and Don’t Even Know It” and “What? Me, Biased?” It was the focus of a popular book by Banaji and Greenwald, featured prominently in Malcolm Gladwell’s 2005 bestseller, Blink, and in a 2015 film on PBS described thus: “American Denial sheds light on the unconscious political and moral world of modern Americans,” including “research footage, websites, and YouTube films showing psychological testing of racial attitudes.”
The concept of implicit or unconscious bias and the use of the IAT to root it out have worked their way into public policy and our legal system. There have been suggestions to incorporate IAT technology into judicial nominations and jury selection. One author proposes looking for implicit bias in legislative action, advocating the use of IAT to “‘smoke out’ illegitimate purposes” and hidden racists among legislators, showing that race-neutral classifications, for example, tap into unconscious race bias. In a 2012 class action suit against the State of Iowa, African American state employees claimed class-wide bias in hiring and promotion based on disparate impact statistics and implicit racial bias. Expert witness testimony on unconscious racism was central to their claims. The case became the first site for dueling experts on the scientific status of implicit racism. Anthony Greenwald was the expert witness for the plaintiffs, and Philip Tetlock, a psychologist at the University of Pennsylvania, was the expert for the state of Iowa. Ultimately, the judge rejected the implicit bias theory and ruled for the State of Iowa. The state supreme court unanimously upheld the trial judge’s ruling.
In the field of criminology, the U.S. Department of Justice has had a program of implicit bias and community policing since 2009. In light of recent events concerning race and police behavior, police departments around the country have held conferences, training sessions, and exercises to deal with the issue of unconscious racial bias among law enforcement. There is little scientific evaluation with proper design, sampling, comparison groups, controls, and statistical analysis showing that they work. Short-term effects have been shown, but results seem to dissipate over time, and, according to one critic, may actually make unconscious bias worse. In addition, it could endanger police officers by causing them to misread real threats and significantly delay reactions for fear of unconscious racism.
The IAT could also be used to analyze how college and university admission committees evaluate applicants, how faculty and teaching assistants grade students, how faculty hire and promote their own, and as an assessment in who studies, teaches, and practices law. Medical schools are actively moving in that direction. Prompted by the American Association of Medical College’s concern with diversity and unconscious bias, medical schools such as Stanford, Ohio State, and Johns Hopkins encourage faculty and students to take the IAT, declaring the test to be both reliable and valid, ignoring its controversy in psychology and related social sciences. Duke University has gone one step further and incorporated it into a second-year medical school course on unconscious bias.
In short, the search for unconscious racism has the potential for widespread educational, media, and judicial “bias training.” Moreover, UCLA Law School professor Jerry Kang and Mahzarin Banaji advocate for permanent affirmative action. Given the extent of unconscious racism, they argue, affirmative action should be disbanded only when unconscious racism disappears nationwide: “Fair measures that are race- or gender-conscious will become presumptively unnecessary when the nation’s implicit bias against those social categories goes to zero or its negligible behavioral equivalent.”
In the psychological and social sciences, however, there is consensus on neither unconscious racism nor the IAT. Many of the controversies focus on technical issues—measurement, validity, and reliability, to name a few. But it is precisely this technical debate that makes the study of unconscious bias and the IAT far from settled science. The strategy of its proponents is to ignore the critics or accuse them of being narrow-minded. Banaji claims that the IAT is to psychology what Galileo’s telescope was to the Copernican Revolution, also drawing an analogy of IAT research to the Copernican and Darwinian scientific revolutions. Banaji acknowledges that this would-be scientific revolution “is going to be the hardest [to accept] of all.”
The IAT findings are threatening, for the studies move us away from the familiar and comfortable. The findings undercut how we see ourselves as thoughtful beings with the free will to be moral and good. Banaji explains:
[It] will challenge our beliefs about the very nature of our own minds…. [I]t is not merely about the place of our planet amongst other planets, [sic] it’s not merely about our place in the larger set of other species, [sic] it’s about the core issue of our competence, [sic] it’s about our goodness, our ability to be moral, and to have control over our thoughts and feelings, about the most important object in our universe, other humans.
But the tide is turning for the IAT. As recently as 2012, other scholars, including University of Virginia law professors Allan King and Gregory Mitchell, point out that social science findings related to unconscious racism and the IAT are “contested research.… This research is the subject of vigorous debate within psychology.… [E]xperts citing IAT research often mischaracterize the findings from this body of work and omit important limitations on the research” (emphasis added).
Before the IAT becomes entrenched in public policy and the law, its proponents should address questions about the reliability and validity of the test. The test should be shown to predict other behaviors, and there should be a broader discussion of the social and political implications of this research.
Flaw Number One: The IAT Is Unreliable
One serious criticism of the IAT has to do with its unreliability. “Reliability” refers to the consistency of a measure—that is, the extent to which repeated applications of a measuring instrument result in roughly the same outcomes.
No measuring instrument is perfectly reliable (i.e., guaranteeing absolutely identical results time after time), but some measures are better than others. Established measuring instruments of physical traits, such as a ruler for height or a thermometer for temperature, are generally less prone to reliability issues compared to instruments in the social sciences.
The American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education have jointly published standards regarding testing. The associations state that the reliability of a test centers on the notion that an individual’s performance is somewhat the same from one test-taking time to another. Estimating reliability should involve calculating a correlation between a test and its retest for many test takers. The test/retest correlation should be large (e.g., over 0.70) if the test is reliable.
The professional associations recognize that an individual’s scores on the same test may vary from one time to another. In the aggregate, group scores reflect some degree of measurement error—the degree to which the scores vary from the true score. In the view of these associations, however, “if a test score leads to a decision that is not easily reversed, such as rejection or admission of a candidate to a professional school or the decision by a jury that a serious injury was sustained, the need for a high degree of precision is much greater” (emphasis added). In other words, the test/retest reliability should yield in the aggregate a coefficient of 0.90 or higher.
Because IAT proponents argue that the IAT taps into racism on the unconscious level, and since racism is such a highly charged accusation, the IAT should be subjected to a high degree of precision. But it is not. As Texas A&M and Florida International University psychologists Hart Blanton and James Jaccard, respectively, observe, the IAT has serious problems of test/retest reliability. The IAT measures reaction time to specific stimuli in milliseconds. Using a micro-measure of reaction time with regard to differences in unconscious racial attitudes is relatively new, according to Blanton and Jaccard, and consequently has significant problems associated with it: “[A] tenth of a second can have a consequential effect on a person’s score, and such measurement sensitivity can lead to test unreliability.”
According to Blanton and Jaccard, the conventionally acceptable correlation for test/retest reliability is a correlation coefficient of 0.70 and rises to 0.90 when used for individual assessment. They find that Greenwald’s test/retest reliability is 0.56, while another group of researchers found a test plus three re-tests over a two-week period caused correlation coefficients to plummet to 0.27. Clearly, the reliability of the IAT is problematic, since the test itself has not changed since the 1990s.
Flaw Number Two: Validity—What Does the IAT Actually Measure?
Aside from its unreliability, unconscious racism and the IAT have other problems. Assume, for the sake of argument, that over time improvements have led to significant IAT test/retest reliability. Reliability is still not the same as validity. Something can be “reliable” in the technical sense of yielding similar results over time yet still not be a valid measure. Validity is a fundamental concern in science: To what extent does the object we study in fact represent the object we want to study? Do our empirical comparisons truly reflect our theoretical concepts? Are we measuring what we think we are measuring?
The number on the oven thermometer is a valid measure of the hotness of the oven, and the number on a pH-scale is a valid measure of the acidity–alkalinity of the soil. Astrological signs, however, are not valid measures of individuals’ personality traits.
Proponents of the IAT brag that they have millions of scores generated from their website, Project Implicit. The number of individuals taking the IAT does not address the issue of a flawed test and flawed results. There are several types of validity, and psychologists do not agree on the categories, but there is a consensus among critics regarding the IAT’s validity.
The first concern centers on construct validity, which deals with whether the measure used in fact measures what it claims to measure. That is, is the IAT a valid measure of the concept, of “unconscious racism”? IAT proponents claim that it is. In order to show that it is a valid measure of the concept, alternative explanations of the differential reaction time when faced with “white” cues or “black” cues must be ruled out.
The key question of construct validity is whether the IAT scores measure unconscious racism or something else. While alternative explanations regarding the meaning of IAT scores are either not considered at all by IAT proponents or are casually dismissed by them, published research shows that, in fact, other social-psychological processes can explain IAT scores. There are several factors that contribute to the results, including:
- Comparing familiar versus unfamiliar words, pictures, and associations;
- Knowledge of stereotypes, instead of prejudice or cultural stereotyping;
- Knowledge of racial disparities and sympathy toward African Americans for this reason;
- The fear of being labeled racist; and
- Physiological and physical factors, e.g., intelligence, physical speed, and manual dexterity.
Familiar Versus Unfamiliar Words, Pictures, and Associations
Raising the criticism of construct validity, Miguel Brendl, Arthur Markman, and Claude Messner designed IAT experiments that suggest alternative explanations to unconscious racism. While Greenwald and his colleagues argued that the longer response times of the “incompatible” pairings of black pictures and pleasant words versus white pictures and unpleasant words tap into unconscious prejudice, Brendl, Markman, and Messner proposed that the IAT registers “familiar” versus “unfamiliar” sets of associations. The more common associations result in faster reaction times; the more distinctive or less common, the slower times.
Brendl and his colleagues used insects and nonsense syllables, then paired them with pleasant and unpleasant words. When they used insects versus nonsense syllables in their experiment, people were quicker to associate insects with pleasant words and nonsense syllables with unpleasant ones. Yet, can one say that insects such as cockroaches and wasps have pleasant associations?
By using nonsense syllables and insects, they got the same kind of results as did Greenwald and his colleagues with the white–black IAT. They suggest that IAT responses are a function of familiarity and unfamiliarity, especially since the differences are measured in milliseconds. The “compatible” sets of associations in Greenwald’s model are basically more familiar sets of associations compared to the “incompatible” or unfamiliar sets. Yet proponents ignore these alternative interpretations.
Brendl and his colleagues’ familiar–unfamiliar explanation is similar to alternative hypotheses proposed by psychologists Klaus Rothermund of the University of Trier in Trier, Germany, and Dirk Wentura of the University of Jena in Jena, Germany. They base their alternative explanations on what is often called figure–ground issues in the psychology of perception (or “salience asymmetries”).
The design of the IAT is such that the test taker is instructed to quickly sort the photos (e.g., white faces, black faces) with highly charged pleasant and unpleasant adjectives (e.g., happy, evil, good, poor), and at the same time, must sort the faces and words with the positive key and the negative key. Then, the test taker is instructed to switch, so that the picture of the black person’s face is associated with the positive attributes and the picture of the white person’s face is paired with the negative. To now undertake this second set of instructions, the brain must first sort the prompts (i.e., is it a white face, a black face, a pleasant word, an unpleasant word) and then re-orient the manual task of pushing the correct button.
The mental tasks of re-focusing and re-orienting may very well account for the disparities in black–white IAT results—disparities that are measured in milliseconds.
The strength of Rothermund and Wentura’s explanation lies in its placement within the scientific paradigm of perceptual psychology. They criticize IAT proponents for ignoring how people process salience asymmetry. The IAT, as “a new experimental paradigm” of social cognition, should confront this alternative asymmetry explanation.
In short, the familiar/unfamiliar, figure–ground explanations provide a powerful paradigm alternative to the IAT proponents. At best, these alternative explanations would dampen the magnitude of race-IAT disparities when properly taken into account. Just as likely, however, they highlight a more fundamental challenge to the dominant unconscious racism interpretation of the race IAT.
Racial Stereotyping, Racial Prejudice, or Knowledge of Stereotypes
The IAT is a test about race—and no one, after all, wants to be labeled a racist. Test takers know exactly what the correct response is supposed to be when photos of black and white faces are systematically associated with heavily charged positive or negative words. University of Colorado Boulder psychologists Charles Judd, Irene Blair, and Kristine Chapleau argue that the IAT taps into cultural stereotypes to which test takers had been exposed, rather than unconscious racism. The IAT proponents conflate automatic stereotyping with automatic prejudice, particularly in highly controversial areas such as police behavior. The researchers state, “If it is the implicit activation of these stereotypes that is responsible for racially biased policing, then it seems to us that rather different interventions are called for than those that would be most appropriate if the problem was due to…highly prejudiced officers.” Others say the IAT does not differentiate between holding cultural stereotypes and being aware of these stereotypes—and holding a stereotype would lead to treating people differently than merely being aware of stereotypes.
Knowledge of Racial Disparities, Not Unconscious Racism
In a similar vein, Gregory Mitchell and Philip Tetlock criticize proponents of the IAT for conflating knowledge of racial disparities with approval of racial disparities. According to Mitchell and Tetlock, if a test taker is aware of the following three propositions, then, by IAT proponents’ definition, he or she is an unconscious racist: (1) There are racial disparities in America; (2) A majority of Americans notice these disparities; and (3) Negative feelings have come to be associated with these widely noticed disparities. Based on these premises, Mitchell and Tetlock argue that “the more people know about the past and present history of American race relations, and about current patterns of inequality, the worse they should score on the IAT.”
Fear of Being Called a Racist, Not Being Unconsciously Racist
The fear of being called racist is yet another possible alternative explanation. Some white test takers would see the results as “confirming” their personal fears that they were racists, deep down.
In one experiment, a group of researchers led by Oberlin College psychologist Cynthia Frantz looked at whites threatened by the possibility of appearing racist and those who were not. Test takers who were more worried about appearing racist had worse IAT results. The psychologists conclude, “Ironically the IAT appears to be the most threatening to people who most want to appear nonracist”—which is the opposite of what the test is supposed to pick up.
These findings led Mitchell and Tetlock to conclude that, for some, the IAT measures “sympathy, not antipathy” for the condition of blacks [emphasis added]. Tetlock and Ohio State psychologist Hal Arkes observe how the race IAT is unable to distinguish between test takers who feel a sense of injustice regarding race in America versus test takers who feel prejudice and hatred towards blacks.
The studies described above are critical to the issue of whether the IAT is a measure of unconscious racism or something else. There are, however, other validity issues that should also be addressed before concluding that the science supports the idea that unconscious racism causes the disparities in IAT results.
Physical and Physiological Factors Affecting the Validity of the IAT
There are many extraneous variables that affect an IAT score. For one, the results of the IAT can be faked. In one study, test takers who deliberately slowed their reaction time to the “compatible” set of pairings (white-positive/black-negative) obtained a significantly smaller response-time difference between the “compatible” versus the “incompatible” associations (black-positive/white-negative). In another study, test takers were trained to increase their speed in the experimental pairings, also leading to a less valid score. Whether slowing the expected pairings or speeding up the experimental ones, this is evidence that results on the IAT can be faked.
Repeatedly taking the IAT also reduces the disparities between the “compatible” and “incompatible” pairings. The creators of the IAT do point out that repeated trials result in better scores and note that the scores of those taking the test for the first time cannot be compared to those who have taken the test more than once. The problem, as Greenwald and his colleagues note, is that prior experience automatically raises post-test scores: “The effect of prior experience means that scores of IAT novices cannot be compared directly with those of non-novices and, for the same reason, posttests cannot be compared directly with pretests (when the pretest is the first IAT taken).… [N]umerically less extreme IAT scores will be observed for those with prior IAT experience” (emphasis added).
Another external factor that improves scores is being in the presence of African Americans. When test takers are shown images of prominent African Americans such as Denzel Washington and Michael Jordan, race IAT scores improve immediately after. In another study, race IAT disparities were reduced after the test taker was first assigned to a group comprised of whites and blacks.
Yet another condition affecting scores has to do with taking the race IAT in one’s native tongue. Bilingual individuals taking the race IAT get significantly different scores, depending on the language. One study showed that when bilingual test takers took the race IAT in their native language (e.g., Spanish), scores were significantly worse than when taking it in their non-native language (English).
Cognitive speed and cognitive flexibility have also been shown to affect IAT response time. Intelligence research finds that intelligence speeds performance on simple tasks. The more intelligent subjects had significantly faster reaction times than the less intelligent. IAT experiments that do not control for intelligence likely overestimate prejudice.
Finally, physical speed and flexibility affect IAT response time. The IAT requires task-switching, as when the “compatible” sets of pairings (white and positive words/black and negative words) are switched to the “unconventional” sets of black and positive words/white and negative words. Those who have greater physical speed and manual flexibility, e.g., younger adults, do better on the test.
Flaw Number Three: Do IAT Scores Predict Racist Behavior?
The discussion on validity has covered some of the alternative explanations for IAT results and the many factors compromising scores. All these explanations and variables raise serious questions about the overall validity of the IAT as tapping into unconscious racism. Some have become more cautious and willing to acknowledge the malleability of IAT scores. A study by University of Massachusetts at Amherst psychologist Nilanjana Dasgupta allows for the manipulability of IAT scores but notes that other race-linked attitudes, postures, and interactions “are also quite malleable depending on the extent to which awareness, control, and motivation are at play.”
Given that IAT results may be affected by some or all of the many factors described above, how much does the race IAT correlate with other measures of prejudice or discrimination? The critics say, “Not much.”
Scientific theories have to address the issue of “predictive validity,” which deals with the association between the independent variable (e.g., SAT score, hiring-exam score, IAT results) and the dependent variable (e.g., college admission, college GPA, job-performance evaluation). What little existing research there is has found less-than-robust correlations between the IAT and other measures, including other controversial measures of racial prejudice and discrimination—“symbolic racism” and race-based “microaggressions.”
IAT Correlations with Symbolic Racism Attitudes
One way IAT researchers conduct validity studies is to correlate IAT scores and scores on a “symbolic racism attitude” scale. Symbolic racism attitudes are the positions on policy issues such as affirmative action (on which the anti–affirmative action response is considered an indicator of symbolic racism). This means that holding conservative beliefs, by definition, makes you racist. Symbolic racism proponents argue that certain attitudes toward affirmative action, work, unemployment, and welfare, among other issues, are actually racist attitudes masked as policy positions.
But in their review of such studies, Mitchell and Tetlock find that the “median result of studies of the implicit-explicit linkage yield estimates of low positive correlations between measures.” In other words, the linkage between unconscious racism and symbolic racism is weak, and results correlating IAT scores and other measures are mixed.
Of course, another problem is that symbolic racism attitudes are methodologically suspect. Regarding symbolic racism and other related studies, or what Stony Brook University political scientists Leonie Huddy and Stanley Feldman call “the new racism,” there is significant controversy within political science regarding the new racism’s construct validity, measurement validity, and predictive validity—the same scientific controversy surrounding the race IAT. Most noted is the criticism and research by Paul Sniderman and his colleagues, who argue that the “new racism” is confounded by conservative ideology, insofar as many issues used as indicators of “new racism” use the language of ideological individualism. Huddy and Feldman, in their own study, found that conservatives opposed race-conscious scholarship programs, whether the programs favored blacks or whites. Huddy and Feldman conclude, “Racial resentment, therefore, is not a clear-cut measure of racial prejudice for all Americans and may convey ideological principles for conservatives.”
In sum, both IAT and symbolic racism measures are problematic, and the two do not even correlate well with one another.
Correlating IAT Scores with Micro-Behaviors
The results relating race IAT scores with “micro-behaviors” are mixed. One study examined differences in subjects and actions when interacting with a white experimenter or a black experimenter. Subjects were coded according to their degree of friendliness, comfort level, eye contact, and body posture among other things. McConnell and Leibold videotaped subjects’ responses when interacting with black experimenters and then with white experimenters and examined corresponding IAT scores. Significant correlations were found between the IAT score and the nature of interaction with white-versus-black experimenters.
Another study, however, produced confounding results. After face-to-face contact, black subjects awarded more positive interaction scores to whites with “more racist” IAT scores. The black subjects awarded more negative interaction scores to whites with better IAT scores (i.e., whites who were less “racist”).
Other studies found that higher race IAT scores were only slightly correlated with greater social discomfort and anxiety in contact with blacks and other groups. In her review of the controversies in IAT research, University of Georgia sociologist Justine Tinkler observes that implicit measures are basically associated with behaviors “that are spontaneous and difficult to control.” These include various test taker micro-behaviors—including eye contact, eye blinks, posture, smiling, speaking time, and speech errors—as part of the interaction between test taker and white/black experimenter, or white/black persons who were just in the room.
Mitchell and Tetlock criticize the interpretation of the micro-behaviors such as looking down or away, halting speech, not speaking, and posture as indicators of racism. They note that these behaviors are also indicators of shame—which, in turn, can be a function of unfamiliarity, uncertainty, fear of being labeled racist, or shame of societal treatment of blacks, among other factors—not micro-behavioral manifestations of racism.
Even a major meta-analysis of the IAT and other variables, conducted by Greenwald and his colleagues, on 188 studies with 184 independent samples and 14,900 subjects regarding the predictive validity of the IAT could not find large correlations. Their meta-analysis involved not just the black–white IAT but the IAT used for many other groups. They found an average correlation of 0.27 between different types of the IAT and various performance, judgment, and physical measures. For black–white race IAT studies, correlation coefficients averaged around 0.24. In a later meta-analysis, Rice University psychologist Frederick L. Oswald and his colleagues found even lower average correlations of roughly 0.15.
Greenwald, Banaji, and Nosek in 2015 point out that the differences in methodology between the studies account for the difference in correlations, but they concede the point that the correlations between IAT results and other behaviors are small. But they do state in their refutation, “Statistically small effects…can have societally large effects.” Except they offer no proof.
The problem of weak correlations was found as far back as 2003. Psychologists Russell Fazio of Ohio State University and Michael Olson of the University of Tennessee concluded in their 2003 review of implicit measures, “One of the most disturbing trends to emerge in the literature on implicit measures is the many reports of disappointingly low correlations among the measures…. Unquestionably, part of the problem with these disappointing correlations among various implicit measures is their rather low reliability.” Earlier they state, “In contrast to the numerous investigations concerning known-group differences, less work has been conducted concerning the prediction of behavior from IAT scores. The evidence that does exist is mixed.”
Not much has changed. In the 2009 Annual Review of Political Science, noted public opinion researchers and political scientists Leonie Huddy and Stanley Feldman examined the work on the IAT, and concluded that, while interesting, “the results of implicit racial attitudes can be confusing.” They, too, note the often-contradictory results between implicit and explicit attitudes.
Huddy and Feldman argue that explicit racial attitude questions, not the results of an unconscious racism test, should suffice to uncover racism, especially when there is time for a respondent to think about policies or a politician (e.g., feelings toward President Obama). Huddy and Feldman state: “[C]ontinuing disputes in psychology over the meaning of implicit attitudes serve as a cautionary note to political scientists interested in incorporating such measures into their research.”
Along similar lines, the lack of consistent and robust correlation between IAT results and other measures led social scientist Justine Tinkler in 2012 to also conclude that it is wrong to discount explicit attitudes and focus only on implicit attitudes.
[I]t is not accurate to interpret explicit attitudes as politically correct and dishonest and implicit attitudes as true attitudes…. [W]ith evidence that implicit and explicit attitudes affect race-related behavior in different ways, it would be a mistake to dismiss egalitarian, non-racist explicit attitudes as dishonest because they reflect social desirability.
The weakness of the race IAT in terms of predictive validity brings us to the last methodological issue—the problem of the false positive.
Flaw Number Four: False Positives Are False Accusations of Racism
The rate of “false positives” and “false negatives” is critical in evaluating the truth-value and utility of any test. A false positive is a result when the condition is detected but is not really there. Everyday examples include the false alarm for home security systems, where the security system indicates an intruder but is in reality the family cat. Another example from medicine is the presence or absence of cancer. One test (e.g., PSA test) indicates possible prostate cancer, but further testing (e.g., a biopsy) turns up negative. This is an incidence of a false positive regarding the first test, where a different test was subsequently used for assessment. Less common is the discussion of false negatives. This is where the test indicates the absence of a condition (e.g., cancer), but the condition is really there.
The first indicator that the IAT may produce many false positives is that the correlations of IAT scores with other indicators of racism are low. Low correlations indicate an insensitive test, and therefore a high level of false positives. In other words, many who are labeled racists by the IAT are in fact not. Mitchell and Tetlock’s analyses find the false-positive rate for the IAT falls between 60 percent and 90 percent, depending on the study. Since we do not know the true state of one’s character (whether a person is truly racist or not), we have to use other measures (e.g., behavioral or attitudinal responses) as substitutes to discover false positives and false negatives. Mitchell and Tetlock use survey attitudes, where at most 30 percent appear to give a “racist” response on an attitude question. The table below shows the more widespread “racist response” finding that this author took from Huddy and Feldman’s “The American Racial Opinion Survey” (AROS).
Answers in the affirmative (“great deal,” “some,” or “little”) for items five and six were used as indicators by Huddy and Feldman of overt racism. Roughly 40 percent thought that differences in standardized tests were due to racial differences in intelligence; 35 percent thought the black–white difference was a “fundamental genetic difference between the races.”
For the analysis below, this report will use the 40 percent of overtly racist responses, for the sake of argument, plus assuming that 75 percent have a high enough IAT score to be unconscious racists. This 40 percent gives us the largest percentage of true positives and the smallest percentage of false positives regarding the IAT. If we have 1,000 individuals taking both the AROS survey and the IAT, let us assume the following: 400 would give the racist survey response; 750 of those taking the IAT would get a racist IAT; and the IAT would pick up 90 percent of those truly racist (i.e., racist IAT score and racist survey response).
The false positive rate is the non-racist attitude as a percentage of the racist IAT score. Since there are 390 persons with a non-racist response on the survey question but a racist score on the IAT, if we calculate the false-positive rate, we get a false positive of 52 percent. This means that 52 percent of those with a racist IAT score are not racist.
Conversely, the false negatives are those persons who are truly racists according to the survey item but are not detected by the test. There are 40 of these persons, which gives us a false-negative rate of 16 percent. In other words, the IAT would fail to detect roughly 16 percent of the truly racist.
Are a 52 percent false-positive rate and a 16 percent false-negative rate acceptable? This is not a scientific question, but an ethical and policy one. In the real world, the IAT was suggested for screening students and faculty, job hiring and promotion, and jury selection, to name a few instances. Over half those taking the race IAT will be falsely accused, while 16 percent of the truly racist will slip by.
The error rate for the IAT is sufficiently high that even one of the founders of Project Implicit admitted, “Taken together, there is substantial risk for both falsely identifying people as eventual discriminators and failing to identify people who will discriminate.”
Flaw Number Five: IAT Results Cannot Be Generalized to the Real World
To what extent can the race IAT findings generated from a sample of undergraduates in a college laboratory be generalized to the real world? IAT proponents occasionally acknowledge the problems of moving from scholastic research to the public policy and legal arenas. The meta-analysis conducted by Greenwald and his colleagues excluded the Internet-based IAT findings on the Harvard-sponsored Project Implicit website because the latter’s test conditions are unreliable and not valid. But proponents use the Web-based numbers to give it the appearance of scientific authority.
Even the creators of the IAT, Greenwald and Banaji, acknowledge in their book, Blindspot, that the website-based findings of Project Implicit cannot be generalized to the American public. In 2015, in a technical paper, Greenwald, Banaji, and Nosek concede that the scientific issues associated with the IAT mean that the test should not be used for individual assessment.
Elsewhere, Brian Nosek and Rachel Riskind, assistant professor of psychology at Guilford College in Greensboro, North Carolina, agree with critics that the IAT generates high false-positive rates and therefore should not be used for individual diagnostic purposes.
Although the reviewed evidence shows that this stereotype is true in the aggregate, in our view, applying this to individual cases disregards too much uncertainty in measurement and predictive validity.… [T]he present sensitivity of implicit measures do not justify this application.
As a concrete example, Mitchell and Tetlock present a list of workplace best-practice conditions that make generalizing from IAT research to the American workplace dubious. Significant differences between the lab setting and workplace include the following:
- Only race is considered in the lab, while supervisors look at employees’ backgrounds, work histories, and past performance.
- The test taker has no knowledge of the experimenter’s views, while organizations typically publish an official view on discrimination, prejudice, affirmative action, and diversity. There is no close future contact between experimenter and test taker, and negative responses have no future consequences, while there is often future interaction among employees and between employees and supervisors. Negative interactions between supervisors and employees have consequences.
- Test takers do not expect to work with the experimenter in a team, and negative IAT lab results have no future teamwork consequences, while organizations expect teamwork between supervisors and employees.
- Test takers are usually college students and lack experience in supervising others. Workplace managers and human resource personnel have experience in hiring and managing others.
For these many differences, Mitchell and Tetlock conclude that the IAT should not be used for workplace evaluations. In the workplace, there would be hiring and promotion consequences as well as legal and financial ramifications if an employee or manager is said to be an unconscious racist based on the IAT. The same kind of analysis applies to such fields as teaching, criminal justice, and health care.
Conclusion and Implications
Proponents of the IAT have not shown that the test unequivocally measures unconscious racism and have failed to rule out alternative explanations. Likewise, the IAT has not been shown to correlate with other established measures of prejudice and discrimination, and little research shows it predicting discriminatory behavior. There are high rates of false positives and false negatives associated with the test. The IAT has not been shown to apply to real-world settings.
Notwithstanding these problems, IAT proponents seek its widespread adoption in public policy and legal arenas. Ignoring the differences between scholarly research and real-world implementation, big institutions have hired private consultants and instituted proprietary programs to correct, train, and generally root out unconscious racism and other forms of bias.
The American Association of Medical Colleges (AAMC) makes unconscious bias a focus of medical school curriculum and medical practice reform. Despite the current scholarly consensus that the IAT should not be used for individual diagnostics, the AAMC encourages its use to discover the extent of an individual’s unconscious biases, and various medical schools do the same. Prominent medical schools feature the IAT on their websites and encourage people to take the test.
In the AAMC’s forum on diversity and inclusion, the IAT is referenced throughout, starting with the AAMC making the highly disputed claim, “The IAT has been rigorously tested for reliability, validity, and predictive validity and has been shown to be a methodologically sound instrument for measuring unconscious associations.”
At Ohio State University, members of the medical school admission committee took the IAT in order to screen for their “implicit white preference,” followed by a presentation on unconscious bias and strategies for its reduction, before screening applicants.
A medical school, of all places, should know all about the proper administering of tests. While citing the work of Greenwald, Banaji, and others, these elite medical schools and the AAMC seemed to have missed the part about the limitations of the IAT—that it should not be used for individual diagnostics, such as assessing the unconscious bias of individual committee members before actually screening candidates.
At UCLA, administration officials encouraged all to test themselves for unconscious racism, speaking of “a series of troubling racial climate incidents.” The university has a Vice Chancellor for Equity, Diversity, and Inclusion. The office is currently held by Jerry Kang, Professor of Law and Asian American Studies, who has written numerous articles and guides on implicit-bias studies and the law, including a recent lecture on the implications of implicit-bias research and the Equal Protection Clause. He is assisted by “diversity prevention officers,” responsible for investigating bias among faculty and to ameliorate the situation by holding re-education and training sessions, pointing to the IAT as a resource. In 2017, UCLA required all members of faculty search committees to undergo training in spotting implicit bias, including mandatory viewing of UCLA’s full Implicit Bias Video Series before starting the search.
In the corporate world, according to the Wall Street Journal, roughly 20 percent of large corporations currently provide some sort of unconscious bias training—a figure estimated to grow to 50 percent by 2020, which has given rise to consultants and technology to stamp out unconscious bias.
Policy experience has shown, however, that even the best-designed and executed studies and programs (e.g., FDA studies, randomized reading research in education) can have non-generalizable results and lead to unintended consequences. Public policy application of social science findings must be significantly more demanding, analogous to FDA drug studies—and even more so when there are legal consequences. There are no scientifically based evaluation studies showing that the programs implemented to correct unconscious bias are reliable, valid, and effective and that subjecting people to these programs yields the desired real-world outcomes.
Yet many are not shy about proposing how the IAT could be used. After all, what are the societal consequences for promulgating a less-than-rigorous scientific theory of prejudice? Should the arguments of reliability and validity be limited to the concerns of methodologists and psychologists?
Researchers inserting themselves into the policy and legal arenas on behalf of social intervention is a problem, and, as Nosek and Riskind observe, many experimental scientists fail to appreciate the complexity of the policy process. Unlike the university science lab, the real world is full of intervening variables. In the eyes of the public, it further politicizes the social sciences and further erodes the disciplines’ credibility.
Moreover, by emphasizing the finding that unconscious racism is spread throughout society, even among its most “enlightened” elite, the race IAT research may sadly serve to make race relations worse. False accusations of racism are highly likely, and true instances of racism lose their salience. The real difficulty is the public cynicism and indifference that results when accusations are made, new policies are implemented, and millions of dollars are spent on the problem—with little perceived progress. Ultimately, unconscious racism, cultural stereotyping, stereotype threat—or whatever is actually measured by the IAT—is regularly overcome in everyday life.
Given the high probability of errors associated with the IAT, it should not be incorporated into public policies, such as hiring and university admissions, housing, banking, and government contracting, by law enforcement, in lawsuits, or in jury selection. Although it has been hailed by the media as uncovering a dark, secret side of the American psyche, numerous critics of the IAT have demonstrated that it simply cannot predict how test takers will act in the real world. The test fails to prove that we are a nation of unconscious racists.
—Althea Nagai, PhD, is Research Fellow at the Center for Equal Opportunity. The author would like to thank the Center for Equal Opportunity for its editorial support and advice throughout the publication process.