A Gold Mine of Student Data

What high school curriculum best prepares a student for college? Which majors yield the highest-paying jobs? Does being held back in kindergarten ultimately help or hurt lifelong educational performance?

A new data collection program may provide the necessary long-term information to answer such questions. But access to the data should not be so onerous that only government agencies can get to it. And although answering policy questions is an important goal, it mustn’t be accomplished by violating students’ rights to privacy.

North Carolina is one participant in the Statewide Longitudinal Data Systems (SLDS) Grant Program, a project of the U.S. Department of Education. The program provides grants and services to help states design, develop, implement, and expand K-12 and P-20W (early learning through the workforce) longitudinal data systems—that is, data that follow individual students from pre-school to the workforce.

Since 2005, 47 states, the District of Columbia, Puerto Rico, and the Virgin Islands received SLDS grants. Only Wyoming, New Mexico, and Alabama are not participants in the program. Since its inception, the program has disbursed $362 million in grants.

North Carolina received $6,000,000 in 2007 for the North Carolina Common Education Data Analysis and Reporting System (NC CEDARS) and $3,639,543 in 2012 for “The NC P-20W SLDS Project: Creating a Preschool to Workforce Statewide Longitudinal Data System in North Carolina.”

For North Carolina’s universities, the SLDS grants won’t change which data are collected, just how they are used. Data are already collected from many offices at each campus, including offices of admissions, the registrar, financial aid, enrollment management, and, institutional research; that won’t change. UNC’s interface with the cross-sector longitudinal data system will chiefly be through a new UNC system for collecting and storing data called Student Data Mart, which will streamline data from the 16 campuses into one easy-to-access repository.

Dr. Daniel Cohen-Vogel, senior director of institutional research for the general administration, explained Data Mart’s main objective. “Right now, we spend about 80 percent of our time on data collection and quality control and 20 percent on analysis. We want to reverse those numbers.” Many smaller campuses, he said, don’t currently have the manpower or technology to adequately warehouse or analyze data. Student Data Mart will be a shared resource that provides that capacity.

If the new system is transparent, it will be a gold mine for social science research. Data sets with such a large number of variables will give researchers a powerful tool to investigate long-term educational outcomes that were previously impossible.

Before this project was initiated, researchers relied mainly on small data sets, proxies for actual information on certain variables, institutional-level data (which can only yield ball-park estimates), or the federal beginning postsecondary data set (that’s released at most once a decade).

For example, one leading piece of research on the efficacy of Pell grants uses data from the Wisconsin Scholars Longitudinal Study, which follows only 6,000 students. To create even that limited data set, its director amassed grants from six different sources. The Data Mart, will offer more data than the most dedicated scholars, acting alone, could hope to collect.

There are serious privacy concerns that must be addressed. In order to follow students from year to year, from pre-school to grade school to university and to the workforce, students must be identifiable. This means that each student will have an id number assigned to his or her name. At some level, that information is written down and warehoused. It must be protected.

The Family Educational Rights and Privacy Act (FERPA), which protects the privacy of student education records, provides the framework for preventing legal disclosure of such data. But it’s also imperative that those personal data are not illegally used and are not leaked, sold, or hacked.

Even when precautions are in place, data breaches happen. At UNC-Chapel Hill, such a breach was discovered just last month. According to the university’s website, “on July 30, 2013, during maintenance involving one computer, the safeguards that protected the files against public access were accidentally disabled.” During that time, personal information involving employees and students was accessible. And at UNC General Administration, other data problems exist.

In December 2013, the North Carolina state auditor’s office issued a critical report on the university’s financial reporting system, saying, “UNC-GA has also not clearly defined responsibilities related to long-term preservation of data, creating an elevated risk of losing historical student and financial data.” Any longitudinal data must be guarded against such breaches—too much personal information is at stake.

But those privacy concerns must be balanced with access. Data do no good if they are simply warehoused and ignored—or accessible to only a few government insiders.

Current barriers to access may already be too great. The most widely used source for data on North Carolina’s elementary and secondary students, teachers, and schools is the North Carolina Education Research Data Center (NCERDC) at Duke University. The Data Center provides what it calls “ready access” to the data needed for policy-oriented research. But that access requires a seven-to-nine step process and access to an Institutional Review Board (which individuals and small non-profit institutions do not have). The Department of Public Instruction’s Statistical Profile provides access to aggregated information, but not the kind of data researchers need to address most policy questions.

Cohen-Vogel agreed that access to the UNC data should be broad, saying “We should err on the side of making [the data] accessible to the broadest population of folks within the confines of the law (FERPA). The more parameters we put on [access], the more subjective the process gets.”

In this case, the federal government provides a model of how to ensure the data are available to a wide audience. The National Center for Education Statistics’ (NCES) DataLab provides several different types of data interface for both novice and experienced researchers, from preset tables to regression tools.

DataLab allows users to quickly find answers to simple and complex questions. For example, one researcher might want to determine the relationship between SAT score, GPA, and graduation rates. Another might simply want to know the most common reason students give for transferring between institutions. The answers to both can be found using DataLab.

The public portal to DataLab never shows students names or unique indentifiers.

In sum, North Carolina’s longitudinal data presents an opportunity, but also a risk. If the risks are addressed, and the data are accessible, new data can help to answer important questions about higher education. And North Carolina could be a model to other states.