Strategic Communications and Marketing News Bureau

Project will help researchers explore big data in HathiTrust digitized library

CHAMPAIGN, Ill. — Illinois English professor Ted Underwood wants to know how the language describing male and female characters in works of fiction has changed since the late 18th century. He’s using data-mining tools to gather information from thousands of books to answer that question.

The problem, though, is that books published after 1922 are still under copyright protection and their content can’t be shared freely online.

“There are hundreds of thousands of books out there, and we don’t talk about them,” Underwood said. “That is a dark landscape after the wall of copyright comes down. We can read the books one by one, but we can’t make generalizing claims at all.”

A project of the HathiTrust Research Center – a collaboration between the University of Illinois and Indiana University – aims to get around that problem and allow scholars to analyze large numbers of books while still respecting copyright laws. The project is being funded by a two-year, $1.17 million grant from the Mellon Foundation.

J. Stephen Downie, the HTRC co-director and a professor in the Graduate School of Library and Information Science, is the Illinois project lead. “Researchers at GSLIS and HTRC are interested in the intersection between the humanities and big data, and in finding ways to advance our use of computational tools to make large strides in humanities research. In this way, we can help scholars like Ted add to our cultural understanding,” Downie said.

A consortium of more than 100 university and public libraries, HathiTrust has amassed a massive collection of digitized texts containing nearly 14 million volumes and 5 billion pages. About two-thirds of the works in the HathiTrust collection are still under copyright protection.

Until recently, scholars were limited in their research by how many volumes they could read. Now they can use big data to answer research questions through computational analysis that gathers information from huge numbers of books, a concept called “distant reading.”

For example, a scholar may want to find the pages in a book that contain a poem or an image of interest, or find connections between two authors or between an author and a place.

New tools under development at the HathiTrust Research Center will create metadata to better describe the works and search individual pieces to find required information. They will also allow scholars to visualize the data they get in response to a query – for example, how much information comes from a certain time period or geographic region.

But even when scholars know what they want to analyze, the copyrighted material can’t be freely distributed to them online. The Mellon grant will make possible further work on an initiative, the “HTRC Data Capsule,” that will allow researchers to analyze data without violating copyrights. When the data is retrieved, it is released to the researcher without that person ever having access to the online text, thus respecting the copyright protections – a model called “nonconsumptive research.”

“The only thing that is really accessing the data is the algorithm,” Downie said. “It’s very important we remain good copyright citizens.”

HathiTrust Research Center has already developed a prototype that can search smaller sets of data – thousands of volumes. Now researchers want to scale that up so the system will be able to run a search of hundreds of thousands or millions of volumes.

“The HathiTrust collection is big data in size. To step through all the nearly 14 million digitized books in 24 hours would require 14,000 computers running simultaneously,” said Beth Plale, a co-principal investigator for the project, a co-director of the HathiTrust Research Center and a professor of computing and informatics at Indiana University. “The funding received by the Mellon Foundation will allow us to extend the prototype to larger machines.”

It will also help build the secure computer environment necessary for dealing with copyrighted content, she said.

Being able to analyze works in the vast HathiTrust collection means Underwood and other humanities researchers can ask broader questions and have a much larger, more diverse dataset to work with. The research results Underwood gets “could end up being a significantly different picture,” he said. “I’ll be much more confident getting results that reflect a diversity of literary traditions.”

University Librarian John Wilkin said, “By bringing together computation, tools and this remarkable body of text, we can facilitate new and more innovative approaches to solving big problems in a wide array of disciplines. That’s a remarkable difference-maker.”

HathiTrust Research Center was established with the help of seed money from both the University of Illinois and Indiana University, as well as funding from the Eli Lilly Endowment. HathiTrust is providing $1 million over four years with equal collaborative funding from Illinois and Indiana. In addition to support from Mellon, HTRC activities have received funding from the Sloan Foundation, the Institute for Museum and Library Services, the Social Science and Humanities Research Council of Canada, and the National Endowment for the Humanities.

Editor’s notes: To reach J. Stephen Downie, email jdownie@illinois.edu. To reach Beth Plale, email plale@indiana.edu.

Read Next

Campus news Vikram Adve, Rohit Bhargava, Andrew Suarez and Jennifer Teper.

Faculty members honored with 2025 Campus Awards for Excellence in Faculty Leadership

Four University of Illinois Urbana-Champaign faculty members were honored by the Office of the Provost with the 2025 Campus Awards for Excellence in Faculty Leadership.

Campus news University of Illinois Urbana-Champaign students Lindsay Bitner-Mitchell and Cecelia Escobar have been selected to participate in the U.S.-U.K. Fulbright Commission’s Summer Institutes program. Photo collage: Fred Zwicky

Two Illinois students selected for Fulbright’s Summer Institute to the UK

Two University of Illinois Urbana-Champaign students received places in the Fulbright Commission’s Summer Institutes program.

Research news Portrait of Lissette Piedra standing in front of a bookcase wearing a beige jacket and black shirt

Study reveals how social networks shape health in later life

CHAMPAIGN, Ill. ― A new study sheds light on the powerful connection between social networks and health in later life and reveals a surprising path for improving health equity among older adults. Published in the journal Innovation in Aging, the study tracked over 1,500 older adults for a decade using three rounds of data from […]

Strategic Communications and Marketing News Bureau

507 E. Green St
MC-426
Champaign, IL 61820

Email: stratcom@illinois.edu

Phone (217) 333-5010