CHAMPAIGN, Ill. — The University of Illinois has developed a repository that stores the data of Illinois researchers and provides access to it for other researchers who want to use the data in their own analyses.
The Research Data Service – a new program based at the University Library and designed to help Illinois researchers manage their research data – developed the Illinois Data Bank, a file-based repository that preserves research data. The service is available to U. of I. faculty, staff and graduate students.
Heidi Imker, the director of the Research Data Service, said grant-funding agencies and publishers are increasingly asking researchers to make their research data publicly available to others to review or reuse.
“Although at first pass it can seem like a burden, making data available has many benefits for our campus researchers,” Imker said. “Not only does sharing give others the ability to really verify or extend your findings, the transparency shows you have confidence in your research. It can also extend the impact of your work by serving as another research product that can be discovered and cited.”
Vice Chancellor for Research Peter Schiffer said the Illinois Data Bank helps ensure the U. of I. is a good steward of its intellectual resources.
“We’ve undertaken this effort to facilitate open access to federally funded research data, but it also is the right thing to do at an institution like ours. Data and research results drive innovation and enable discovery, and as an international research leader, it only makes sense that we should be out in front on this effort,” Schiffer said.
To ensure scalability, the data bank uses the University Library’s digital preservation platform and the storage infrastructure of the National Center for Supercomputing Applications.
The data bank had a soft launch this summer, with a handful of faculty members testing the service by depositing data. One of them is Tandy Warnow, a professor of computer science and bioengineering. Her research is in the area of phylogenomics – the estimation of evolutionary histories. She has studied, for instance, how different bird species are linked on the avian family tree and how various languages developed in relation to each other.
“We do a huge amount of simulation and testing to understand methods and how methods work on data sets,” Warnow said. “What this amounts to is a lot of data that needs to be stored and made available to the research community.”
The real challenge, though, with Warnow’s dataset on her avian phylogenomics research was that it was organized in tens of thousands of files, showing simulations of the evolution of various DNA sequences, estimations of how the evolutionary trees might look, and then how multiple evolutionary trees might fit together.
“I’ve never had to store this amount of data,” Warnow said. “Having data-sharing resources available is an unbelievably helpful and necessary thing for us in terms of transparency and the data being usable in the research community.”
Researchers often generate new data to explore various methods of analysis, she said, and it’s difficult for other researchers to exactly reproduce the same data sets to test their own analyses. Having access to another researcher’s data set also helps identify any problems in the way they generated or analyzed their data, she said.
Some researchers store their data on their own servers, but how data is organized, where it is stored and whether it is publicly available varies with each person, Imker said.
For Daniel Work, a professor of civil and environmental engineering, the issue with making his research data available was not just storing it but maintaining it.
Work looks at how to build smarter transportation systems by using large data sets to estimate traffic conditions. He recently looked at data for all the taxi trips in New York City from 2010-13 – a total of 700 million taxi trips – and he used NCSA’s supercomputer to analyze the data hour-by-hour to produce estimates of traffic patterns.
Most of his colleagues don’t have access to a supercomputer, so they would like to have access to his data. Even those outside of his field would like to use such a large dataset to help answer their own research questions, Work said.
Work said the staff at the Research Data Service helped him figure out the best way to store the data using the Illinois Data Bank and make it available to other researchers.
“It’s been super-helpful to have a team of smart people trying to figure out what the implications are of the choices we have of storing the data,” Work said. “It reduces the degree of difficulty of sharing to the point where the technical issues are largely addressed.”
The Research Data Service staff has been fine-tuning its procedures to handle the data sets. Imker said the staff provides curation by reviewing datasets for potential issues, adding keywords to improve discoverability, suggesting additional documentation, or linking the datasets to publications, software programs or associated data in another repository.
Illinois is working with the University of Minnesota, Washington University, Cornell University, the University of Michigan and Penn State University to develop a shared-data curation network through an Alfred P. Sloan Foundation planning grant. Imker said each campus will provide expertise in a particular subject area.
“We’ve taken a very pragmatic but forward-thinking and scalable approach with the Illinois Data Bank,” Imker said. “We feel confident that we’re providing a resource that is both sustainable and of benefit to the Illinois research community and beyond.”