To Address the Irreproducibility Crisis, Invest in Digital Archiving

An obscure scholarly practice may decide how free—and accurate—information remains in the future.

[Editor’s note: This article belongs to a series on the intersection of science, technology, and higher-ed reform. Please click here to read J. Scott Turner on federal research oversight and here for Nathan Schachtman on the need for STEM education for lawyers.]

“Millions of Scholarly Articles Are Not Being Archived,” says the article title, and the initial impulse is to roll one’s eyes and think cynical thoughts. Given the quality of scholarship nowadays, isn’t that all to the best? And even if we are being generous about the quality, there is such a deluge of scholarship, isn’t salutary pruning a necessity? If all 92 of Euripides’ plays had survived, wouldn’t we think the worse of him? Archival failure surely will make posterity think better of 21st-century scholarship.

But archival issues do matter, not least because they concern the irreproducibility crisis of modern science—no, let that wait a little. First let’s talk about exactly what’s at issue.

Archival issues matter, not least because they concern the irreproducibility crisis of modern science.Scholarship—publication in general—is most of the way through a vast switch-over from paper format to digital. Most everything is published digitally nowadays; paper copies are ancillary. Millions and millions of works are now being published online. In the scholarly article that started this discussion, the authors state casually that, “of the 7,438,037 works examined, there were 5.9 million copies spread over the [digital] archives used in this work.” A staggering amount of intellectual content is now not only available digitally but produced digitally and intended to be read digitally.

But of course that cannot be done unless scholarship is made easily accessible to potential readers. Librarians and archivists for generations have been elaborating best practices to make sure that the reader can find what they want in a flood of information. What they do includes creating and standardizing metadata—Title, Author, Date of Publication, Subject Keywords—and then making sure that search engines and algorithms can search efficiently among what also has become a metadata deluge. Preserving, cataloging, and providing access to digital items represents a further level of headache-inducing complexity. How do you describe an electronic file? How do you refer to different editions? How do you guarantee that the file’s contents haven’t been changed?

Digital archiving is an essential part of this larger problem of preserving access to digital records. What happens when hardware or software becomes obsolescent, so that you can no longer read a digital file? What happens when a file is preserved on just one server, and a squirrel chews a crucial wire or a lightning bolt strikes an essential circuit? How do you make sure there are up-to-date copies of millions of digital files? How do you ensure that there are multiple archives with responsibility for everything that has been published? How do you ensure that every single archive is able to use a standardized format, in common with every other archive? How do you pay for the necessary computer equipment? How do you get archival staff trained to do the necessary work?

All this must be mind-numbingly boring to the layman. It is a bleak irony that the amount of scholarship devoted to digital archiving is sufficient to augment appreciably the amount of scholarship that needs digital archiving. Yet it is an issue that matters, and we should be grateful that there is a large body of professionals devoted to this issue.

It matters because—I mentioned before the irreproducibility crisis of modern science. That’s the combination of groupthink, publication bias, discarding of negative results, and culpably negligent use of statistical analysis that has led to modern science research comprising what may be a majority of false research findings. There’s a great deal that needs to be done to fix the irreproducibility crisis, but you can’t even begin if you haven’t properly archived every bit of research in a field—or, indeed, if you lack the capacity to provide proper archiving for research.

Digital archiving particularly matters for meta-analyses, which analyze entire bodies of individual research findings.Digital archiving particularly matters for meta-analyses, which analyze entire bodies of individual research findings and are an essential tool for gauging the evidentiary value of an individual piece of research. Meta-analyses work only if you can locate and include the entire body of relevant research. Without proper digital archiving, meta-analyses become Garbage In, Garbage Out (GIGO) analyses.

And indeed, one of the solutions to the irreproducibility crisis is to require born-open data, as well as generally to require that all research data be publicly accessible prior to and after publication. Not only academic articles but also all scientific research data should be stored in multiple repositories, in a standardized format accessible to every interested reader. The proper archiving of research data is not a fundamentally different challenge, but it does add significantly to the work digital archivists must undertake.

Then there are the national-security implications. The Department of Defense has adopted a formal applied ontology to organize and integrate its data descriptions. Effective digital archiving must be an essential component of that work. If the basic digital objects lack fixity and distributed preservation, how can the ontology function effectively? In this way, proper digital archiving contributes to national security.

Digital archiving also must face the challenge posed by high-tech censorship. Amazon can now delete a “problematic” word or argument from your Kindle book without alerting you. A book company can alter the latest edition of a book; a scientific journal can change or withdraw an article—any digital object can be manipulated or removed by high-tech gatekeepers without alerting the public. Perhaps the greatest challenge in the field of digital archiving is to distribute fixed digital objects so broadly that they cannot be censored by private or governmental actors and to provide every individual a quick and secure means of assessing whether a given digital object remains unaltered—and a means of locating the unaltered original. This is a task necessary for the preservation of liberty, and it is, fundamentally, a task of digital archiving.

It may not be a task for the institutional profession of digital archivists. They are institutional, necessarily servants of Belial and of Mammon, and are not professionally oriented toward individual liberty and individual archiving. Too many, indeed, are woke and censoring. This is a task for a Peter Thiel or an Elon Musk—to establish a means of digital archiving that will serve the individual and his liberty and that will provide the means to keep scientists honest.

Digital archiving matters too much to be left to the digital archivists. But having said that, let us also praise them for the good, hard, and (oh, poor archivists!) terribly dull work they are doing.

David Randall is the research director of the National Association of Scholars.