Non-Renewed Copyright Analysis

Digging into NYPL's digitized CCE Corpus

Jul 22 2020

Intro

Due to the bizarre nature of US copyright many works published in the mid-20th century are actually in the public domain. If a resource was published in a specific range of years and did not have its copyright renewed that material is likely open. The problem is determining this status because the records of this renewed status is locked up in the Library of Congress Copyright Office’s Catalog of Copyright Entries (CCE). These records are analog lists of course, but New York Public library has had an initiative to transform these records into data. This summary is an oversimplification of the situation and circumstances (more details, and watch Creg Cram’s presentation on the project) but it is enough to get us to my specific research question: If there are potentially 1000s of published works that are in the public domain what TYPE of works are they? What subjects do they cover, are they mostly already available to the public, are they widely held by libraries? This analysis uses the NYPL data to try to understand this question.

Scope and Data

The NYPL is producing this data, it is ongoing and there is a lot of it. To get started I used Leonard Richardson’s 2019 data export of materials likely not renewed. This dataset is just a small fraction of the total CCE, and it is messy. Here is a small example:

Scroll right to see other columns

The first problem is if we want to understand more about these resources we need to connect them to structured data. The CCE data only has a limited set of fields. We don’t have things like subject headings or other bibliographic fields. This data was used for copyright, not libraries. In this dataset there are occasional identifiers like Library of Congress LCCNs, but there are very few and mostly invalid. There is also wide variation in how the fields were populated, for example the author field has a wide range of formatting and variation.

With no identifiers we are left to try name and title lookups. But plopping “Great American Novel by Jane Smith” into a bibliographic database is not going to get you very far. But we do have one advantage based on the flow of materials at the Library of Congress. Because of copyright law publishers needed to deposit two copies of their work to the Copyright Office. They would do their processing and then pass the materials over to the Library who would decide if this deposited material should be added to the library’s collection. As it is today not everything that is sent to the Copyright Office is added to the Library. A sudoku puzzle book might be copyrighted but it is not likely going to be accessioned into LC’s collection. But because of this relationship there is overlap, a name title lookup in LC’s catalog is much more likely to find the correct resource then a larger aggregator database like Worldcat.

Fortunately as of 2019 LC’s catalog has been loaded into id.loc.gov as Bibframe providing API search access to the catalog. Using this I can try to do lookups to match each CCE record to a LC catalog record. Once I have a likely LC record I have access to all that data plus the structured fields like names and modern identifiers I can then use in other bibliographic databases to get more info.

This initial reconciliation drastically reduces the record count. The non-renewed CCE file had 520,000 records in it, I was only able to reconcile 54,000 LC records with a very high degree of confidence. My feeling is a lot of the non-reconciled records are things like grey literature and other copyrighted but not accessioned materials. So we are using 54,000 records as a sample to do this analysis, this number could be increased with human intervention into the process but is a good size to get a sense of what is in the non-renewed corpus.

Analysis

To get a sense of what types of materials there are I made a treemap of their Library of Congress Classification (LCC).

Treemap of 54K Non-renewed resources by Library of Congress Classification (LCC) hierarchy

Click to zoom, click "all" to unzoom

This shows us the largest category of materials are Language and Literature which is composed mostly of American Literature and Fiction and Juvenile Belles Lettres. From this map we can see the majority of the non-renewed resources are humanities based, history, philosophy, etc. But there is also a significant amount of social sciences and natural sciences.

I also wanted to allow you to explore the corpus at a record level. This tool allows you to browse by subject heading. You can facet on two subject headings and see the resulting record details including links to external sites like Library of Congress, Worldcat, HathiTrust and OpenLibrary.

Open in new window

I also wanted to show how available these works are currently both online and physical. This first graph shows the number of institutions that hold each work. This shows us the majority of the non-renewed materials are widely held by U.S. libraries. However there is a long tail of resources held by only a small number of institutions.

Resouces by number of holding institutions

HathiTrust has been doing this work of making non-renewed materials available to the public for a long time. I wanted to see based on this corpus what percentage is available on their platform and its availability.

Resouces by HathiTrust status

This chart shows us that about %30 of the corpus is already Full View public full-text available on HathiTrust. A smaller percent is on the platform but not marked as public view. And about 42% of the 54K I was not able to find on the platform at all.

This last graph breaks this data down into top level LCC categories for the three types of access, full, limited and not found:

Click to change graphs

Non-Renewed resources with "Full View" HathiTrust status by top level LCC

This stream graph shows that the largest category of material that was not found on Hathi at all is for Language and Literature and Philosophy, Psychology, and Religion. There are other interesting trends, for example in the Full View category there are more materials at the beginning and end of the range, and contractions that likely relate to publishing trends for those years.

These are just a few of the possibilities to slice into this data. I’ve made the 54K dataset available here to download if you would like to experiment with it as well. As the datafied version of the CCE grows from NYPL’s project it would be interesting to run this analysis again and of course there is opportunity to improve the process.

Download

Macro to Micro

That brings the large scale analysis of this dataset to close. But I wanted to try to go from this huge list of 54,000 books down to just one book and see how far I could get.

While the titles were flicking across my screen in the endless reconciliation and remediation processes I pick one, based on the strange name alone:

			Title: Bizzy-Quizzy the great 
			By: Bowker (William Rushton)

It was not on HathiTrust at all, and Worldcat says it only exists in two libraries in the world (it is also in the Library of Congress) meaning 3 copies of it exist in libraries. The metadata makes it look like its fiction, and about a science detective, or something. So I bought it online, a book dealer had one copy available, meaning there are at least 4 copies of it in the world.

When it finally arrived I dug in, and dear reader, it is an extraordinarily bad novel. Self-published by William Bowker, a retired UCLA chemistry professor in 1942. It’s dedicated to Mahatma Gandhi and World Peace and features a frontispiece image of the author and Babe Ruth for some reason. It’s a series of short stories where Bowker imagines himself as “Bizzy-Quizzy” a great detective, mostly recovering gemstones and using chemistry for MacGyver-esque escapes. And really, I’m not trying to dunk on a nearly 80 year old self-published novel but it is mostly unreadable. So I put it on Internet Archive so you could enjoy it too:

My copy of the book also came with a number of documents, notes and flyers (available at the end on IA). So I researched Bowker a little and he was a very accomplished academic publishing many books on chemistry. A newspaper clipping was tapped inside the book documenting a probate court case where Bowker had left $2000 in his will to his cats when he passed away. But there was some debate about whether the current cats left in his house were the same 22 cats that were alive when he wrote them into his will. And that’s where I left it, drilling down into one book as much as I could.

The world is a big place and gets bigger everyday, like all books these non-renewed resources reflect some aspect of it. Even if it is the (bad) musings of a 1940’s chemistry professor who loved cats it's nice to open it up and put it out there. These non-renewed books are an important resource and I look forward to seeing the NYPL project develop.

Matt Miller