What SAT Scores Say About Teacher Effectiveness

A little-known design feature suggests the connection may be tenuous.

The SAT has been in the news again, this time because of the claim that test-optional policies are a way for colleges to covertly impose affirmative action. It’s true that such policies have created a two-tier system that allows colleges to accept more black and Hispanic students than would otherwise qualify for admission. But the argument ignores a far more fundamental factor that is poorly understood.

This factor has to do with the way the SAT is designed and marketed. The College Board, which owns the test, says its design permits admissions officers to compare test takers. But, as W. James Popham conveys in Testing! Testing!, if the SAT were loaded up with items that measured the most important material taught effectively by teachers—the reason for its existence in the eyes of many—test scores would be bunched together, making comparisons impossible. To avoid that inevitability, designers are forced to engineer what is called “score spread.”

If the SAT measured the most important material taught effectively by teachers, test scores would be bunched together.There’s nothing sinister about this; it’s strictly a matter of business. To understand why, it’s necessary to go back to the start of World War I, when the country was faced with the urgent need to quickly identify officer candidates. The military found itself ill-equipped for the challenge and turned to the American Psychological Association for help. Working out of the Vineland Training Camp in New Jersey, a special committee came up with the Army Alpha. This standardized aptitude test presented recruits with verbal and mathematical questions that allowed them to be ranked according to their intellectual abilities.

Recruits who scored high were sent to officer-training school, while low-scoring recruits were given other military duties. In all, some 1,750,000 men took the Army Alpha, which proved to be enormously successful for the task at hand. Encouraged by the efficiency with which large numbers of subjects could be sorted, designers of standardized achievement tests decided to apply the Alpha approach to whatever was being measured.

It was at this point that standardized testing entered the morass that it still finds itself mired in today. The cause is the widespread misunderstanding that conflates aptitude tests and achievement tests. The former are designed to predict how well a test taker is likely to perform in a future setting. In contrast, the latter are designed to measure the knowledge and skills that a test taker possesses in a given subject.

Scores on the two types of test may be related, but they do not necessarily correlate. That’s because, although all mental tests are “g-loaded to some degree,” as Joseph Soares explains in SAT Wars, the smartest students don’t always work as hard in school as their innate intelligence would indicate. The College Board makes this clear in its study finding that 25-30 percent of high-school students have what are known as discrepant scores. That means that their SAT score and their GPA are at variance with each other. Furthermore, while the College Board attempted several years ago to align the SAT with Common Core curriculum standards, the nonpartisan education-reform organization Achieve has suggested that that alignment is incomplete at best. Therefore, the test bears little relation to the actual work of students in class.

Items answered correctly by too many test takers are almost always deleted when the SAT is revised.To maintain credibility, SAT designers take great pains to ensure that items on the test pass muster. New items are placed in an unscored section of an actual test to determine how well they discriminate among test takers. If an item performs well, it is selected for future placement in a scored section. The goal is always to engineer score spread so that test takers can be compared. This item analysis provides valuable feedback in the form of Differential Item Functioning.

Items answered correctly by too many test takers—typically, by more than 80 percent, as Testing! Testing! relates—are almost always deleted when the SAT is revised. What develops is an educational Catch 22: The more effective teachers are in teaching the most important material in their subjects, the more likely they will fail to get the recognition they rightly deserve, particularly to the extent that students’ SAT scores are used to gauge teacher performance. This “editing” process will not be affected by the College Board’s January decision to make the test shorter (two hours rather than three), simpler (reading passages will be easier and followed by a single question), and digital (scores returned in days not weeks).

The public’s confusion about the SAT hasn’t been helped by changes to the test’s name over the years. In 1926, when it was first conceived by Carl C. Brigham as an instrument for promoting greater meritocracy, it was called the Scholastic Aptitude Test in the belief that it assessed innate ability. But by the 1990s, the College Board had had second thoughts. It renamed its premier instrument the Scholastic Assessment Test because of concerns that the original designation was too often associated with eugenics. Eventually, the board did some serious soul-searching and altered the name to simply “the SAT,” which stands for nothing.

Part of the College Board’s historical dilemma has had to do with its claim that the SAT was not coachable. But Stanley H. Kaplan, who went on to establish the test-preparation company bearing his name, proved that this was not the case by helping students in his Brooklyn neighborhood dramatically boost their scores. His secret was constant practice with immediate feedback.

Stung by Kaplan’s success, the College Board was finally reduced to playing its trump card and insisting that the SAT had predictive value. But Bates College called that assertion into question when it made test scores optional. In 2004, it announced that its 20-year study had found almost no difference in the four-year academic performance and on-time graduation rates of 7,000 submitters and nonsubmitters. Although other colleges followed the test-optional policy in admissions, it was the University of California that packed the greatest punch by announcing, in December 2021, that it would not use any entrance exam whatsoever, even though a committee of faculty leaders had concluded that the SAT and ACT were worth keeping.

The shift to using standardized exams for teacher accountability is the single most important change in testing in the past half century.In the face of such changes, the College Board is fighting to stay relevant. According to FairTest, a nonprofit that advocates for the limited use of standardized tests, more than 73 percent of all four-year colleges and universities no long mandate an entrance exam for admission. As a result, the College Board’s fees have declined to $760 million from $1.05 billion the prior year. Moreover, the company laid off about 14 percent of its employees.

Despite confusion surrounding the nature and design of the SAT, standardized test scores may continue to be prominently featured in the media as ostensible evidence of teacher quality. Daniel Koretz, in Measuring Up, calls this widespread public misunderstanding “the dirty secret of high stakes testing.” He explains how the goal of standardized tests is to end up with items that some, but not all, test takers will answer correctly. The shift from using such tests diagnostically to using them for teacher accountability is, according to Koretz, “the single most important change in testing in the past half century.” Unfortunately, it will likely only accelerate in the years ahead.

Walt Gardner is a former a lecturer in the UCLA Graduate School of Education. He blogs about education at theedhed.com.