Strategic Communications and Marketing News Bureau

Project will help researchers explore big data in HathiTrust digitized library

CHAMPAIGN, Ill. — Illinois English professor Ted Underwood wants to know how the language describing male and female characters in works of fiction has changed since the late 18th century. He’s using data-mining tools to gather information from thousands of books to answer that question.

The problem, though, is that books published after 1922 are still under copyright protection and their content can’t be shared freely online.

“There are hundreds of thousands of books out there, and we don’t talk about them,” Underwood said. “That is a dark landscape after the wall of copyright comes down. We can read the books one by one, but we can’t make generalizing claims at all.”

A project of the HathiTrust Research Center – a collaboration between the University of Illinois and Indiana University – aims to get around that problem and allow scholars to analyze large numbers of books while still respecting copyright laws. The project is being funded by a two-year, $1.17 million grant from the Mellon Foundation.

J. Stephen Downie, the HTRC co-director and a professor in the Graduate School of Library and Information Science, is the Illinois project lead. “Researchers at GSLIS and HTRC are interested in the intersection between the humanities and big data, and in finding ways to advance our use of computational tools to make large strides in humanities research. In this way, we can help scholars like Ted add to our cultural understanding,” Downie said.

A consortium of more than 100 university and public libraries, HathiTrust has amassed a massive collection of digitized texts containing nearly 14 million volumes and 5 billion pages. About two-thirds of the works in the HathiTrust collection are still under copyright protection.

Until recently, scholars were limited in their research by how many volumes they could read. Now they can use big data to answer research questions through computational analysis that gathers information from huge numbers of books, a concept called “distant reading.”

For example, a scholar may want to find the pages in a book that contain a poem or an image of interest, or find connections between two authors or between an author and a place.

New tools under development at the HathiTrust Research Center will create metadata to better describe the works and search individual pieces to find required information. They will also allow scholars to visualize the data they get in response to a query – for example, how much information comes from a certain time period or geographic region.

But even when scholars know what they want to analyze, the copyrighted material can’t be freely distributed to them online. The Mellon grant will make possible further work on an initiative, the “HTRC Data Capsule,” that will allow researchers to analyze data without violating copyrights. When the data is retrieved, it is released to the researcher without that person ever having access to the online text, thus respecting the copyright protections – a model called “nonconsumptive research.”

“The only thing that is really accessing the data is the algorithm,” Downie said. “It’s very important we remain good copyright citizens.”

HathiTrust Research Center has already developed a prototype that can search smaller sets of data – thousands of volumes. Now researchers want to scale that up so the system will be able to run a search of hundreds of thousands or millions of volumes.

“The HathiTrust collection is big data in size. To step through all the nearly 14 million digitized books in 24 hours would require 14,000 computers running simultaneously,” said Beth Plale, a co-principal investigator for the project, a co-director of the HathiTrust Research Center and a professor of computing and informatics at Indiana University. “The funding received by the Mellon Foundation will allow us to extend the prototype to larger machines.”

It will also help build the secure computer environment necessary for dealing with copyrighted content, she said.

Being able to analyze works in the vast HathiTrust collection means Underwood and other humanities researchers can ask broader questions and have a much larger, more diverse dataset to work with. The research results Underwood gets “could end up being a significantly different picture,” he said. “I’ll be much more confident getting results that reflect a diversity of literary traditions.”

University Librarian John Wilkin said, “By bringing together computation, tools and this remarkable body of text, we can facilitate new and more innovative approaches to solving big problems in a wide array of disciplines. That’s a remarkable difference-maker.”

HathiTrust Research Center was established with the help of seed money from both the University of Illinois and Indiana University, as well as funding from the Eli Lilly Endowment. HathiTrust is providing $1 million over four years with equal collaborative funding from Illinois and Indiana. In addition to support from Mellon, HTRC activities have received funding from the Sloan Foundation, the Institute for Museum and Library Services, the Social Science and Humanities Research Council of Canada, and the National Endowment for the Humanities.

Editor’s notes: To reach J. Stephen Downie, email jdownie@illinois.edu. To reach Beth Plale, email plale@indiana.edu.

Read Next

Arts Photo of a park with letters spelling out "Freedom Square," children playing and various structures in the background.

Architecture professors design structures with community organizations for Chicago design festival

CHAMPAIGN, Ill. — The Chicago Sukkah Design Festival is an architectural design festival in the Chicago neighborhood of North Lawndale that brings together architects and community organizations to create gathering spaces to connect residents. University of Illinois Urbana-Champaign architecture professors participating in this year’s festival built a bicycle kiosk and a pop-up theater, which will […]

Engineering Physical Sciences Science and Technology An artist's rendering of a variety of nanoparticle shapes

Atom-scale stencil patterns help nanoparticles take new shapes and learn new tricks

CHAMPAIGN, Ill. — Inspired by an artist’s stencils, researchers have developed atomic-level precision patterning on nanoparticle surfaces, allowing them to “paint” gold nanoparticles with polymers to give them an array of new shapes and functions. The “patchy nanoparticles” developed by University of Illinois Urbana-Champaign researchers and collaborators at the University of Michigan and Penn State […]

Announcements Photo of the researcher

Illinois chemist named 2025 Packard Fellow

Benjamin Snyder, a professor of chemistry at the University of Illinois at Urbana-Champaign, has been named a 2025 Packard Fellow by the David and Lucile Packard Foundation. Photo by Holly Birch Photography

Strategic Communications and Marketing News Bureau

507 E. Green St
MC-426
Champaign, IL 61820

Email: stratcom@illinois.edu

Phone (217) 333-5010