WoodBlockShop

So what gives? I started searching and found a lot of other people having the same question. Why does the demo site do automatic masking so much better than the local model? Their github issue had tons of tickets like this and the general consensus is that the model released is not the same model as used on the site, at least for the automatic masking functionality. I would love to hear otherwise but it appears the quality of the auto masking in the demo is not attainable using the released model at this point.

This kinda derailed my goal but I still wanted to try some other approaches so I used rembg which can use a number of models to do background removing. I wasn’t interested in strictly removing the background though, I wanted to pull out elements of the woodcuts and this tool performed okay, a lot of messy transparencies, but also a lot of clean crops. If I were to do this again I would spend more time here, maybe try creating my own model based on the corpus.

Next I wanted to try the LLaVA multimodal model. You can pass it an image and prompt to do something with it. I got it running with Ollama pretty easily and started sending the cropped images with a prompt asking for search terms describing the image. I basically wanted to use the model to build a browsable index of the images. The challenge of course is that getting the model to describe the content and not the medium, and also it produces a very long tail of single terms that apply to only one image, not useful for building buckets. From the list of a few thousand terms it produced for all of the images I truncated it at four uses per term. Meaning the term had to be used on at least four images to make it into the list. I will use these tags to build a browseable list.

If you go to the original Plantin-Moretus Museum site you will find nice broad term tags created for the images on the site. For example “BOTANICAL” or “ANIMAL” which are great and I’m sure were machine generated. The terms coming out of the LLaVA model are more descriptive and narrow but lack controlled vocabulary and are all over the place. I think you could spend a lot of time here to get better quality terms and groupings out of the model, but this was a good start. And of course regardless of how it is produced, machine generated terms are going to be lower quality and probably not suitable for official metadata records. But for something like this where there is no existing descriptive metadata and they are going to be used only at the application layer it seems like a good way to at least attempt to generate some more description for 14,000 item level resources.

We got the cropped resource and some descriptions, now let’s spend way too much of our free time building an interface! I wanted to make something that sort of lets you creatively explore the images and do something fun with them. The result is this interface that lets you overlay the elements from the woodcuts to composite an image from the parts. I also added a social aspect where you can share the final creation and it will post it on a new Bluesky account called WoodBlockShop along with links to the original sources of the images used in the composition.

The app lives here: https://woodblockshop.glitch.me/

But you can watch a video of how it works:

What I like about it is the sort of exploration of the collection via creation, I also like the visual prompting of ideas through the random assortment of image elements it gives you. I found myself randomly combining elements that then prompted a visual narrative or memory from me. It’s also giving a “it’s 2015 and we’re at a cultural heritage hackathon” vibe and I kind of miss that energy to be honest.

Code Links:

I don’t have a repo for this project, here are some of the scripts used however:

Using SAM2 and automatic masks: https://gist.github.com/thisismattmiller/50e77d2af8cd2480732067b75ac65ca3
Using LLaVA: https://gist.github.com/thisismattmiller/39e6ca24c18cf8c5bcc90033afbb35b1
Source to the interface: https://glitch.com/edit/#!/woodblockshop

Using GPT on Library Collections

Automatic transcription of Alan Lomax’s 1938 Midwest Folk Song Collection using Whisper.cpp

Matt Miller

WoodBlockShop

Using Segment Anything, LLaVA and other methods on a 14K image corpus

Dec 16 2024

Code Links:

Related Posts: