Hathi PD 2025

Intro

Towards the end of the year HathiTrust quietly posts a collection of their resources that are set to enter the public domain in the new year. This is such a great resource and for the past few years I’ve done some visualistions or lists for 1925, 1926, 1927 and 1928. This year, for 1929 I spent some more time enriching the dataset and building three interfaces to explore the data than in the past. This post will walk through the three tools I developed and talk a little about the enrichment going on.

TLDR; If you don’t want to go on my metadata journey and just want to use the tools: Check them out here.

Stats

First let’s look at the scope of what we are working with. The HathiTrust collection has over 70,000 volumes. This means physical objects that were published in 1929 and entered the public domain this year 2025. But it is of course not one-to-one in terms of objects and intellectual work. In this case the 70K objects reduce down to about 50,000 individual works. The 20,000 reduction is in the form of multiple volumes belonging to a single work. A Book may have 10 volumes which are counted in the 70K but only once in the 50K.

So out of the 50K we can break it down into format. About 34K (68%) of them are books while 16K (31%) are journals. I spent a lot of time enriching the data because a common issue is that these metadata records are barebones, there is not a lot of description available in the MARC record provided by HathiTrust. So we need to look around and ask for more data. We can do this by reconciling the works to data repositories like the Library of Congress and OCLC. A big thing I want to have in the record is the Library of Congress Classification number (LCC) and then subject headings. Only about 38% of the records had a LCC originally so using a lot of scripts I went through and got that number up to 40K or 81%.

Litterature Browser

I think a big obstacle for exploring these titles is that you are faced with a wall of books with no real good way to dig into them, it's very opaque. I wanted to make an interface that focused only on literature or fiction, this ended up being a tiny fraction of the collection, only 3200 titles, but still a lot of books. My goal was to be able to say something like, “I want to see horror books from 1929” for example. To do something like that you need to have metadata that is not going to be in the original minimal record. Even after enriching the titles from sources for a lot of them it is really hard to say what exactly the title is about. LCC can place them roughly in a category like “Fiction and juvenile belles lettres” but that is as deep as it goes. LCSH Subject headings are almost non-existent for this corpus as well.

The nice thing though is that these titles are now in the public domain. Meaning you can access the full text of them. So let's use that to build application level description and genre terms. I passed the first 50 pages of each book through the Llama-3.3-70B-Instruct LLM model to add some new metadata. This model is nice because you can actually download it if you want, while it is not open source it is open weight, meaning you can run it locally if you want/can. I asked the LLM to build genre tags, a description blurb and check for overtly racist words/sentiments. Old books are gonna old book, and of course you can place things into context but sometimes you just don’t want to read pages of racial slurs, right? So I thought it would be nice to flag those off the top. It works okay, it took a bit of testing to get the model to actually do this but I got it eventually. The genre and description are fine, I am not advocating LLM use on metadata of record of course, but for these 1000s of books with no context of what they are or hooks to browse them I think this works well enough for an interface. I also included the LCSH when available.

Take a look at the interface: https://thisismattmiller.github.io/hathi-pd-2025/lit-browse/

Canned search, for example public domain horror books from 1929:

Here is a video:

LCC Browse

While the last system was only for literature this tool lets you browse anything that had an LLC, around 40,000 titles. This includes the journals and other non-fiction. This interface takes full advantage of the LCC hierarchy. You can dig down through the various levels of the system and sort by things like holdings count, which would tell you the most popular title from that range.

Take a look at the interface: https://thisismattmiller.github.io/hathi-pd-2025/lcc-browse/

Here is a video of the interface:

LCC Visualization

The final tool is similar to the last, just visual. It lets you see the 40,000 titles by LCC in an icicle graph. You can click and visually drill down and explore titles.

Take a look at the interface: https://thisismattmiller.github.io/hathi-pd-2025/lcc-viz/

Here is a video:

Code/Data

You can find the scripts and data here: https://github.com/thisismattmiller/hathi-pd-2025/

Matt Miller

HathiTrust 1929 Pubic Domain

Data and tools to explore 50,000 1929/2025 public domain titles in HathiTrust

Feb 5 2025