William J.B. Mattingly
banner
wjbmattingly.bsky.social
William J.B. Mattingly
@wjbmattingly.bsky.social
Digital Nomad · Historian · Data Scientist · NLP · Machine Learning

Cultural Heritage Data Scientist at Yale
Former Postdoc in the Smithsonian
Maintainer of Python Tutorials for Digital Humanities

https://linktr.ee/wjbmattingly
You can now process Hebrew archival documents with Qwen 3 VL =) --- Will be using this to finetune further on handwritten Hebrew. Metrics are on the test set that is fairly close in style and structure to the training data. I tested on out-of-training edge cases and it worked (Link to model below)
October 30, 2025 at 1:16 PM
Does anyone have a dataset of 1,000 + pages of handwritten text on Transkribus that they want to use for finetuning a VLM? If so, please let me know. This would be for any language and any script.
October 27, 2025 at 5:56 PM
More coming soon but finetuned Qwen 3 VL-8B on 150k lines of synthetic Yiddish typed and handwritten data. Results are pretty amazing. Even on the harder heldout set it gets a CER of 1% and a WER of 2%. Preparing page-level dataset and finetunes now, thanks to the John Locke Jr.
October 24, 2025 at 8:14 PM
Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.
October 24, 2025 at 2:59 PM
Reposted by William J.B. Mattingly
DeepSeek-OCR just got vLLM support 🚀

Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.

Processing at ~350 images/sec on A100

Using @hf.co Jobs + uv - zero setup batch OCR!

Will share final time + cost when done!
October 22, 2025 at 7:20 PM
Want an easy way to edit the output from Dots.OCR? Introducing Dots.OCR editor, an easy way to edit outputs from the model.

Features:
1) Edit bounding boxes
2) Edit OCR
3) Edit reading order
4) Group sections (good for newspapers)

Vibe coded with Claude 4.5

github.com/wjbmattingly...
GitHub - wjbmattingly/dots-ocr-editor
Contribute to wjbmattingly/dots-ocr-editor development by creating an account on GitHub.
github.com
October 21, 2025 at 4:21 PM
Reposted by William J.B. Mattingly
Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...
October 16, 2025 at 1:22 PM
🚨Job ALERT🚨! My old postdoc is available!

I cannot emphasize enough how much a life-altering position this was for me. It gave me the experience that I needed for my current role. As a postdoc, I was able to define my projects and acquire a lot of new skills as well as refine some I already had.
September 24, 2025 at 1:50 PM
Reposted by William J.B. Mattingly
Excited to be co-editing a special issue of @dhquarterly.bsky.social on Artificial Intelligence for Digital Humanities: Research problems and critical approaches
dhq.digitalhumanities.org/news/news.html

We're inviting abstracts now - please feel free to reach out with any questions!
DHQ: Digital Humanities Quarterly: News
dhq.digitalhumanities.org
September 9, 2025 at 8:28 PM
Something I've realized over the last couple weeks with finetuning various VLMs is that we just need more data. Unfortunately, that takes a lot of time. That's why I'm returning to my synthetic HTR workflow. This will be packaged now and expanded to work with other low-resource languages. Stay tuned
August 14, 2025 at 4:08 PM
I've been getting asked training scripts when a new VLM drops. Instead of scripts, I'm going to start updating this new Python package. It's not fancy. It's for full finetunes. This was how I first trained Qwen 2 VL last year.
August 13, 2025 at 7:38 PM
Let's go! Training LFM2-VL 1.6B on Catmus dataset on @hf.co now. Will start posting some benchmarks on this model soon.
August 13, 2025 at 4:58 PM
Training on full catmus now and the results after first checkpoint are very promising. Character and massive word-level improvement.
August 13, 2025 at 3:53 PM
LiquidAI cooked with LFM2-VL. At the risk of sounding like an X AI influencer, don't sleep on this model. I'm finetuning right now on Catmus. A small test over night on only 3k examples is showing remarkable improvement. Training now on 150k samples. I see this as potentially replacing TrOCR.
August 13, 2025 at 3:01 PM
New super lightweight VLM just dropped from Liquid AI in two flavors: 450M and 1.6B. Both models can work out-of-the-box with medieval Latin at the line level. I'm fine-tuning on Catmus/medieval right now on an h200.
August 12, 2025 at 7:18 PM
Reposted by William J.B. Mattingly
With #IMMARKUS, you can already use popular AI services for image transcription. Now, you can also use them for translation! Transcribe a historic source, select the annotation—and translate it with a click.
August 12, 2025 at 9:48 AM
GLM-4.5V with line-level transcription of medieval Latin in Caroline Miniscule. Inference was run through @hf.co Inferencevia Novita.
August 12, 2025 at 1:50 AM
Qwen 3-4B Thinking finetune nearly ready to share. It can convert unstructured natural language, non-linkedart JSON, and HTML into LinkedArt JSON.
August 11, 2025 at 8:07 PM
Reposted by William J.B. Mattingly
Discover Magazine did a nice feature on the Voynich Manuscript. I had a delightful conversation with Sam Waters, and here's the result. (there's also a print version in the current issue) #linguistics #Yale #YaleLibrary #BeineckeLibrary #conlangs

www.discovermagazine.com/was-the-worl...
Was the World’s Most Mysterious Manuscript from the Middle Ages A Hoax?
Its indecipherable text and illustrations have stumped scholars throughout history. Here’s what we know about the Voynich Manuscript’s content and creation, and whether the text is truly as mystifying...
www.discovermagazine.com
August 8, 2025 at 10:20 PM
I spent the last 24 hours finetuning Dots.OCR with different datasets. Here are some of the things I learned... TLDR Don't sleep on this model! If you do HTR or OCR try this.
August 7, 2025 at 1:36 PM
Vibe coding is great for quick visualizers for projects. Single Claude 4 Sonnet prompt and viola.
August 6, 2025 at 8:45 PM
Finetune of Dots.OCR working! The output isn't quite there yet. Still a lot more training to go, this is just a checkpoint. This is far better than the default model, though! It learns to adjust the bboxes to your style very early in the training, so layout parsing is easier to work into the model.
August 6, 2025 at 3:16 PM
Woot first got the first finetune of dot.ocr on @hf.co ! Going to share the training script hopefully today. This finetune is for old church slavonic.
August 6, 2025 at 1:14 PM
Reposted by William J.B. Mattingly
Yeah, really impressive. There are some minor errors, but really much better than anything I've tried until now.
August 6, 2025 at 11:59 AM
@ponteineptique.bsky.social any interest in finetuning this with me on HTR-United data? I'm thinking the logical place is to start with Catmus segmented with the HTR data. It can handle the layout parsing and HTR, so is there a good way to get both data for all catmus?
August 5, 2025 at 3:37 PM