Library of Congress Web Archives

Big Picture

Apr 23 2020

It was great to see the web archiving team at the Library of Congress recently celebrate twenty years of archiving. And they had a great article in the New York Times last month. I really like the LC web archive for a number of reasons. One is that it is heavily curated, meaning a lot of metadata is created for each site that is archived. It also means it is thematic. The NYT article opens with the archived memes, and web comics but I really like looking at the election candidate sites. A lot of earlyish, political campaign sites for congress are archived there. Beyond being a valuable scholarly resource it is always interesting to experience the early web, and to view it through the lense of these candidate’s websites is unique.

I was thinking about how most library searches are known item searches. Meaning the patron knows what they want and are trying to locate the resource. I would imagine that is doubly true for web archives, it’s hard to “browse” a web archive. Looking for examples I read this paper on looking at imporving the user experince of web archives. And made me think about visualizing a web archive and what that would look like. I’m a neophyte to web archiving really, though having been adjacent to it my whole career. I know pretty much nothing but even I know about things like Archives Unleashed project and other visualization of web archives. But I basically had the desire to literally see everything at once, all of the LC archived sites. So I did the most basic idea you could do, take all the websites in the archive, take a screen shot of the homepage and display them together ordered chronologically. The result is +21,000 web captures displayed in this zoomable interface:

view in new window

You can click on a site in the interface and it will tell you about it and provide a link to the catalog entry at LC. I like that you can see the earliest websites in the top left evolve into more modern idea of websites in the bottom right. Is this a visualization? Probably? Is this a practical interface? Probably not? I like things that give you a sense of scale. These are just the first page of most of the sites in the LC archive. It’s impressive not only in the number of things but also because of the way such a large number of things have been carefully collected.

Technical details: There are over 25K sites in the LC archive, this is only 21,300 of them. Some of the sites are not available outside of LC’s campus, so these are the public sites. There were also a small number of problems that prevented me from capturing the page like problems with it on the wayback instance for example. So this is most of the sites, but not every last one.

The process was:
-Scrape all the sites from the Project 1 API
-Ask the Wayback site for the index page for all the captures and pull out the link to the first capture
-Use python and selenium to make a capture of the page as a png.
-Chop up these images and reduce them slightly to make them more normalized.
-Tile them into a giant 100kx100k pixel image and cut it into tiles.
-Also save the metadata for where each page is placed so when you click the system and figure out what site is that location and give you the metadata.

Matt Miller

Library of Congress Web Archives

Big Picture

Apr 23 2020