WoodBlockShop

Using Segment Anything, LLaVA and other methods on a 14K image corpus

Dec 16 2024

I came across people sharing the Plantin-Moretus Museum collection site and I really liked it. The 14,000 woodcuts are obviously all very different but the monochromatic cohesion of the collection ties it together. I thought it would be a great collection to try some of the various image based machine learning techniques I’ve seen recently. Specifically I wanted to try the Meta Segment Anything 2 model on it thinking it would be very cool to break down some of the very dense but simple prints into component pieces to do something fun with.

Having been a laptop loser for the last 15-ish years (still am tbh) I built a new computer this summer. I haven’t put together a computer since my early 20s but it went pretty well. I specifically put a nice Nvidia card in it to try some local machine learning stuff. The card is more of a gamer card but with 24GB of VRAM it can still run a lot of models in memory. To get the models working I ran Windows to get the Nvidia drivers and cuda toolkit installed and working easily, then used Windows Subsystem for Linux which lets you have a linux environment within windows which makes using it tolerable.

Getting the corpus was pretty straight forward, the site has a nice JSON search index. I downloaded all the images, I used the “thumbnail” version of the images for my project, while they have very large hi-res tiffs available I didn’t use them. Most of the almost 14K images are a few hundred kilobytes, none larger than a megabyte.

You probably saw the Meta Segment Anything demo site it is super impressive. Specifically the ability to do automatic masking (as opposed to defining a specific target region) and basically have it pick out prominent features and build masks which lets you extract the sub-images. If you use the demo site it does a great job so I started running my woodcut images through the SAM 2.1 Hiera Large model. I asked it to use its automatic mask generator to pull out features. And it worked terribly compared to the demo site. So I spent a couple hours tweaking the parameters and it got better but nowhere near as good as the demo on the Meta site. Here are some comparison images:

Good automatic masking results from the Demo website:
Default parameter local model auto masking results:
After significant tweaking of the local model parameters:

So what gives? I started searching and found a lot of other people having the same question. Why does the demo site do automatic masking so much better than the local model? Their github issue had tons of tickets like this and the general consensus is that the model released is not the same model as used on the site, at least for the automatic masking functionality. I would love to hear otherwise but it appears the quality of the auto masking in the demo is not attainable using the released model at this point.

This kinda derailed my goal but I still wanted to try some other approaches so I used rembg which can use a number of models to do background removing. I wasn’t interested in strictly removing the background though, I wanted to pull out elements of the woodcuts and this tool performed okay, a lot of messy transparencies, but also a lot of clean crops. If I were to do this again I would spend more time here, maybe try creating my own model based on the corpus.

Next I wanted to try the LLaVA multimodal model. You can pass it an image and prompt to do something with it. I got it running with Ollama pretty easily and started sending the cropped images with a prompt asking for search terms describing the image. I basically wanted to use the model to build a browsable index of the images. The challenge of course is that getting the model to describe the content and not the medium, and also it produces a very long tail of single terms that apply to only one image, not useful for building buckets. From the list of a few thousand terms it produced for all of the images I truncated it at four uses per term. Meaning the term had to be used on at least four images to make it into the list. I will use these tags to build a browseable list.

If you go to the original Plantin-Moretus Museum site you will find nice broad term tags created for the images on the site. For example “BOTANICAL” or “ANIMAL” which are great and I’m sure were machine generated. The terms coming out of the LLaVA model are more descriptive and narrow but lack controlled vocabulary and are all over the place. I think you could spend a lot of time here to get better quality terms and groupings out of the model, but this was a good start. And of course regardless of how it is produced, machine generated terms are going to be lower quality and probably not suitable for official metadata records. But for something like this where there is no existing descriptive metadata and they are going to be used only at the application layer it seems like a good way to at least attempt to generate some more description for 14,000 item level resources.

We got the cropped resource and some descriptions, now let’s spend way too much of our free time building an interface! I wanted to make something that sort of lets you creatively explore the images and do something fun with them. The result is this interface that lets you overlay the elements from the woodcuts to composite an image from the parts. I also added a social aspect where you can share the final creation and it will post it on a new Bluesky account called WoodBlockShop along with links to the original sources of the images used in the composition.

The app lives here: https://woodblockshop.glitch.me/

But you can watch a video of how it works:

What I like about it is the sort of exploration of the collection via creation, I also like the visual prompting of ideas through the random assortment of image elements it gives you. I found myself randomly combining elements that then prompted a visual narrative or memory from me. It’s also giving a “it’s 2015 and we’re at a cultural heritage hackathon” vibe and I kind of miss that energy to be honest.

Code Links:

I don’t have a repo for this project, here are some of the scripts used however:


Related Posts:

Using GPT on Library Collections
Automatic transcription of Alan Lomax’s 1938 Midwest Folk Song Collection using Whisper.cpp