Flickr Commons is a program to bring the visual collections of cultural heritage organizations to new audiences. Getting these resources in front of people where they are online as opposed to being siloed in their own website or not online at all. It was a pretty ground breaking project, the Library of Congress was the first participant with over 40,000 photos now on Flickr. The program continues today under the Flickr Foundation. Starting in 2008 there is a lot of information about the project, this webcast, a project report, and a 2024 impact report. While the project predates my time at the library by a decade and I have nothing to do with these collections with my job at LC I was really compelled by having potentially 17 years of data about interactions between the public and these materials. This post is going to analyze and visualize that data.
I’ll be using data from the public Flickr API that I harvested back in 2024 (I unreliably work on too many personal projects for years and then eventually something will cause me to finish one, like being furloughed in a US government shutdown). So this is all public data, and the code I use to do everything on this page can be found in this Github repo.
The comments are the big thing with this project. They are the largest interaction surface between the public and the photos. With over 95,000 comments made on the photos over the 17 years there are a lot of questions in my mind as to what people are saying. To organize them I built embeddings for all 95K comments using the Google Gemini gemini-embedding-001 model. This produces a 3072-dimensional vector for each comment which I then reduced to two-dimensional space and ran some clustering over them to build communities of comments. I then sent a random sample of each off to a LLM to classify them into a group based on the actual text of the comments. Here are the communities:
Category ID Count Category Label ================================================================================ 0 4464 Historical Context and Identification with Links 1 2782 Aesthetic Praise and Admiration 2 1973 Biographical & Historical Wikimedia Contributions 3 3765 Factual Historical Annotation 4 5553 Historical Detail Identification and Contextualization 5 678 Explore Congratulations 6 595 See Also 7 1402 Historical Subject Identification and Details 8 2529 Location and Status Verification 9 202 Aesthetic Feedback 10 4754 LC Staff Thanks for Metadata Improvement 11 1204 Non-English Compliments 12 629 Flickr Group Invitations 13 1770 Crowdsourced Historical Data Refinement 14 4230 Historical Performing Artist Biographical Documentation 15 3469 Sourced Historical Details and Context 16 335 Flickr Group Invitations 17 1687 Flickr Group Invitations 18 3315 Location Verification and Contemporary Comparison 19 1054 Cross-referencing and Linked Information 20 466 External Content Feature Notification 21 6199 Observations on Period Appearance 22 2806 LC Staff Thanks for Contributions 23 954 Wikidata Zone đź’Ş 24 2571 Factual Correction and Archival Enhancement 25 1620 Historical Factual Identification and Context 26 2130 Historical Baseball Identification and Contextualization 27 296 Flickr Group Invitations 28 3402 Factual contributions and corrections 29 4184 Factual Identification and Historical Context 30 661 Group Invitations 31 270 Identification and Biographical Information of Historical Figures 32 7373 Historical Photo Annotation 33 5265 Biographical and Genealogical Identification 34 372 Flickr Group Invitations 35 5487 Praise 36 4659 Historical Annotation 37 223 Compliments ================================================================================ Total: 95328
I then created a visualization to display all of these comments. Once loaded you can hover over the comment to see the text, if you click it will take you to that comment on Flickr. Probably does not work well on mobile. Explore the communites below or Open in new tab
The comments aggregate into interesting communities, the majority of them being related to contributing some information about the photo. That can be in the form of context, facts, “See also” links, external links to Wikimedia ecosystem and others. There are also large communities of comments praising the photo, or making related aesthetic comments in addition to more meta Flickr comments.
I really love that there are some hyper specialized communities within the broader group like the red cluster in the lower left of the graph of 2000 comments about “Historical Baseball Identification and Contextualization.” So just a community of comments adding info about baseball related images.
Another one that caught my attention is the “Location and Status Verification” cluster that are comments on typically a John Margolies photograph commenting that what is depicted is still there. Or better yet a set or GPS coordinates or a link to a Google Maps page geolocating the image in the current world.
I took all of these comments, extracted the geo information from them and made a little site to browse all of the photos that had location comments added with the ability to view in google maps street view:
Here is a demo video of the site:
Overall there were a lot of comments made on these photos on the Flickr site. Here is a histogram for the whole 17 years:
Here is a list of the top 50 photos with the most comments:
There are other types of interactions like Tagging, users adding a folksonomy and Notes which are commenting on a region in the image. I was curious if the “80/20” rule applied to all types of user interaction. The idea is basically that 80 percent of the work/effort/whatever of something comes from 20 percent of the people involved. I always associated this metric towards crowdsourcing projects since I learned about it during my NYPL Labs days with all of our crowdsourcing projects. It is formally called the Pareto principle and before I ran the numbers I knew for certain it would be true for this data, because it always is for crowdsourcing but would it be 20% or smaller? Here is an interface to explore the data:
The data shows for comments it is only 11% of users that made 80% of all comments. For Notes it is a little larger with 14% making most of them but for Tags it is extreme with 1% of users responsible for 80% of all tags. The Tags figure is skewed by one single user doing over 70,000 tags and I’m not sure if this was an automated process. Like all crowdsourcing projects these interactions follow the trend of a small core group of individuals really passionate about the material.
Each LC photo on Flickr actually has a MARC record created for it in the LC catalog, which is pretty amazing. Of course there are a lot of useful things that can be in a MARC record including things like Subjects Headings. I wanted to explore two areas using subjects (LCSH) one being the popularity of subjects and the other the relationship between Flickr Tags and the subject headings.
The first is popularity of particular subjects, popularity meaning what LCSH headings had the most interaction on? To do this I just pulled the topical subfields ($a) for each photo and added up the interactions. This interfaces lets you explore that data, it also includes popularity based on subcollection:
I did not see any obvious patterns, I think it mostly falls around collection strength, there are a lot of World War photographs, so that is going to be a popular topic. Likewise for the collection popularity, the majority of the photos comes from the Bain Collection so that will have the most interactions but it is still interesting to see them listed out sorted by interactions.
The other area with subject headings I thought would be interesting is Folksonomy vs Vocabulary. The LCSH headings come from the photo record and the Tags come from Flickr users. I was curious if there are any patterns between what LCSH headings would be assigned vs what the Flickr users are adding to the images.
Let’s look at one example photos with the LCSH headings “Miniature golf”
You can see the authorized headings “Miniature golf” and it has one variant “Golf, Miniature” compared to the Tags Flickr users have added. There are a couple useful variants added such as “mini golf” and “put-put” that would probably make good additions to the LCSH variant labels. But most of the Tags are image specific taggings identifying mini golf locations or features. Some of the tags like Dinosaurs do have LCSH equivalents and were added to the MARC record. Here is a list of all LCSH headings added to the miniature golf images:
These are all the unique headings, each photo would have at least an $a Miniature golf and a $y and then some of them would also have the descriptive feature $a like Frogs or Windmills. LCSH does provide the ability to do this more fine grain topical assignment and the cataloger has utilized it. So in that case I don’t see an advantage of the Folksonomy usage there. But for some of the Name based Tags identifying specific golf courses it is nice because a Name Authority would not be created for those within the library ecosystem so being able to tag entities ad-hoc without the overhead of authority control. Again nothing super surprising here, Folksonomies work well for some things but it really comes down to the application of the correct system and how consistent it is applied.
The last important point I wanted to look into and explore is how these images fit into the internet’s knowledge system, specifically Wikimedia projects. If you explore the comment graph above you will notice a lot of links to Wikipedia and Wikidata. Those projects allow users to immediately act upon information they are creating interacting with the photos. A lot of comments are things like “I just created a Wikipedia article for them” resulting in the collection seeding information into the internet’s knowledge graph. It also helps that these images are public domain and are being ingested into Wikicommons allowing them to be included in any Wikipedia or Wikidata resources created. But in my mind it provides a huge benefit and boost to the user’s ability to contribute back if the resources are tied to Wiki* projects as much as possible.
I wanted to try to explore how these images fit into the Wiki ecosystem so I took all the comments that had a Wikipedia or Wikidata link in them and built a large navigable graph interface to display the photos and the Wiki* resources they are attached to.
You can watch this video of me interacting with the graph or explore it yourself:
The images are populated and then the Wiki* links are resolved and those entities are also added to the graph. So it is building a large interconnected network based on what the Flickr user thought was important to tag in the image. I don’t really think of this as a good interface but I wanted to show the magnitude of information and how empowered users are able to place resources into that context. This graph was built only based on Flickr users adding Wiki* links to their comments and just represent around a quarter of the total images.
That brings us to the end of what I looked at with this data. I think this is a really interesting project dataset because of its longevity and the range of user activities it encompasses.