The US Test Mess

Standardized educational tests do not perfectly measure student aptitude or achievement, and no one argues that they do. But they can differ from all other available measures in two respects: their standardization and their independence of education insider control.

To be truly standardized, the same content must be administered in the same manner to all students. To be independent of educator influence, they must be “externally” administered—that is test materials must be managed and tests administered by non-school personnel.

External administration of a test systemwide to just one grade level of students requires both intensive and extensive logistical management. That is one reason why most countries administer large-scale, consequential tests at only a few grade levels, typically at key transition points. Administering those tests securely requires even more logistical oversight.

Moreover, most relatively wealthy countries maintain educational testing systems that strongly resemble theirs of twenty years ago. I propose that they maintain such system structures because they judge them optimal.

Now, consider what has transpired over the past twenty years in the USA. We were headed in the direction of other countries’ testing system structures at the turn of the millennium, with state-led consequential achievement tests for students administered only every few grade levels.[1] Plus, we benefitted from two competing college admission tests, whose scores could be submitted for consideration simultaneously to thousands of universities worldwide. [2]

Then came three disruptions, each of which, I would argue, served to undermine the utility of US educational testing.

First came passage of the No Child Left Behind (NCLB) Act (2001–2002), which imposed a federal mandate on all public schools (including charters). The NCLB insistence on annual administrations of tests across seven grade levels virtually guaranteed lax security: teachers administer tests in their own classrooms to their own students and principals manage the distribution and collection of test materials in their own schools. Then, we judge schools and teachers based on those NCLB test scores they themselves proctor.

Tests are not all the same. One wouldn’t administer medical or law board exams to a class of kindergartners.

By 2001, most US states had developed their own tests, and testing programs that made sense for them, along with a cadre of testing expertise in their state departments. Most of those state testing programs have since been dismantled, with resources refocused on the testing legally required by the feds—the NCLB tests. Replacing the former state systems are less reliable and more costly tests, that are also less useful to teachers. Fewer tests now have stakes for students, who may or may not try to perform well on tests that may seem pointless to them.

The second disruption occurred more gradually and less noticeably. Starting with the ACT college admission test in the mid 2000s and continuing later with its SAT competitor, some states began to administer the tests to all high school juniors—not with state- or test company-employed proctors but, rather, with school personnel themselves. Previously, ACTs and SATs had only been administered by company personnel in secured venues. With statewide administration, the testing companies relieved themselves of responsibility for test security, leaving its enforcement ambiguous at best.

Adoption of the Common Core Standards brought the third disruption undermining US educational testing. Common Core Standards adoption was achieved by surreptitious means: deceitful promises and bullying by kidnapped revenue (i.e., threatening to withhold federal funds).

Those who still argue that states control US educational testing must ignore some inconvenient facts: every state must annually seek federal approval for their testing programs and two private, unelected organizations, the Council of Chief State School Officers (CCSSO) and the National Governors Association (NGA) Center for Best Practices own the copyright to the Common Core Standards, which 45 states still use, even where they may have changed the name. Common Core tests typically incorporate various progressive education delights, such as confusing but “innovative” formats, open-ended items slowly, tediously, and subjectively scored by fallible humans, and “deeper” learning’s insistence on explaining an answer and solving problems multiple ways.

Degradation of test purposes

Tests are not all the same. One wouldn’t administer medical or law board exams to a class of kindergartners. One wouldn’t administer an in-the-cockpit pilot’s exam for entry into the accounting profession. One wouldn’t administer an IQ test for certification from an HVAC repair program.

A test is most valid and efficient when optimized for a particular purpose. The content of a high school exit exam, for example, should be directly determined by that covered in students’ previous coursework. Exit exams retrospectively measure how well a student mastered that material. They are standards-based achievement tests.

By contrast, the best college admission exams are those most predictive of desired college outcomes, such as good college grades, persistence in college, or college program completion. Two outstanding features have encouraged colleges to include them among the array of factors they consider in admission decisions. The first is the consistent nationwide standard of comparison in contrast to highly variable high school measures, such as high school GPA, class rank, recommendations, essays, and extracurricular activities. The second is the aptitude component, which isolates the abilities of individual applicants that are highly correlated with college success but are not measured by the other admission factors considered. Especially in the US, college differs from high school, and different skills, aptitudes, habits, and preferences emerge more important.

One may recall that an original purpose of the Scholastic Aptitude Test was to find the “diamonds in the rough,” the bright students with college potential who may have been stuck in poor schools, or in too-slow-moving academic tracks, but had acquired some relevant knowledge and skills on their own. The aptitude test was designed to help capable but poor students compete in the admissions process against wealthy students advantaged by better-resourced schools and home environments.

College Board, the organization responsible for the SAT, had been chipping away at its aptitude features for decades, as if they were shameful. Then, a decade ago, the “architect” of the Common Core Standards, David Coleman, ascended to the top spot at College Board, despite a lack of training or experience in testing and measurement or in managing large organizations. He promised to align the SAT directly with the K–10 Common Core standards, thus making the SAT a retrospective achievement test, highly correlated with other admission factors, such as class rank and grade-point average. Some of the several percentage points of “incremental predictive validity” the old SAT had provided would thus disappear.

Meanwhile, ACT Inc. declined Common Core alignment in order to preserve its predictive validity. While preserving its test’s usefulness for its higher education customers, however, ACT may have lost some state contracts as a result. [3] Journalists reported that College Board’s promise of a tighter Common Core alignment may have tipped the scales in its favor for statewide college admission testing contracts in Michigan and Illinois.

K–12 educators in those two states may have assumed that their students would perform better on a Common-Core aligned test. But it’s a min-max game. The more valid a test’s scores are retrospectively, the less valid they can be predictively. What may have seemed advantageous to high school educators should have seemed less so to college admission officers, but they were less essential to the contract decisions.

Most US states now administer a single test in high school to both retrospectively measure achievement and prospectively predict college success, thus reducing its validity for one or both purposes. Those tests are the SAT or ACT, or variations on the original Common Core PARCC and SBAC tests. Indeed, promotors of the Common Core Standards, paid by the Bill and Melinda Gates Foundation and its allies, encouraged states to drop their own achievement testing programs, and rely instead on Common Core tests, which were designed for different purposes.

What we lost with federal intervention

In 2001, an array of US testing firms serving state clients existed with long histories, expertise, large item banks, and time-series data. Soon after, that industry was upended, and it remains in flux. Some firms left the industry entirely. Psychometric personnel move with each contract turnover.

Whether “laboratories of democracy” is an apt description of the US states or not, their sheer number assures a diversity of action by comparison with a single, national monopoly. When rules are made at the federal level, the most powerful interest groups tend to get their way over everyone else. By contrast, ordinary citizens and smaller interest groups have more say with state governments, which are literally closer to them. This may explain why the wise writers of the US and state constitutions designated the states responsible for education.

In the late 1990s and early 2000s, three states tried out testing programs designed by some of the Common Core’s chief designers. In California, Maryland, and Kentucky, state leaders tolerated the dysfunction for a while but ultimately canned the unreliable tests tied to fuzzy standards.[4] Then, with the help of Bill Gates and both Democratic and Republican leaders, the same test designers came in through the back door at the federal level. It remains to be seen if federal leaders will face reality and muster the common sense and fortitude that state leaders did.

[1] And no-stakes diagnostic or monitoring tests administered either nationally (the National Assessment of Educational Progress (NAEP) or locally (with “off the shelf” commercial tests).

[2] Many other countries still require university applicants to take a separate entrance exam at each university.

[3] The University of California System’s recent faculty analysis of admission test utility—whose recommendation to continue using admission tests was overruled by the UC board—found the ACT to be more predictive of college grades, in what was historically an “SAT state.”

[4] Go to https://richardphelps.net/LongEndnote.htm for lists of contemporary references from those three states.

Richard P. Phelps is founder of the Nonpartisan Education Group, editor of the Nonpartisan Education Review, a Fulbright Scholar, and a fellow of the Psychophysics Laboratory. He has authored, edited, and co-authored books on standardized testing, learning, and psychology.

The Great AP Score Recalibration

Are English Departments Really Dying?

Challenging the Academic Publisher Oligopoly