Lightnews — Scholar-powered news

@wjbmattingly.bsky.social

You can now process Hebrew archival documents with Qwen 3 VL =) --- Will be using this to finetune further on handwritten Hebrew. Metrics are on the test set that is fairly close in style and structure to the training data. I tested on out-of-training edge cases and it worked (Link to model below)

October 30, 2025 at 1:16 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Does anyone have a dataset of 1,000 + pages of handwritten text on Transkribus that they want to use for finetuning a VLM? If so, please let me know. This would be for any language and any script.

October 27, 2025 at 5:56 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

More coming soon but finetuned Qwen 3 VL-8B on 150k lines of synthetic Yiddish typed and handwritten data. Results are pretty amazing. Even on the harder heldout set it gets a CER of 1% and a WER of 2%. Preparing page-level dataset and finetunes now, thanks to the John Locke Jr.

October 24, 2025 at 8:14 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.

October 24, 2025 at 2:59 PM

Reposted by William J.B. Mattingly

Daniel van Strien

@danielvanstrien.bsky.social

DeepSeek-OCR just got vLLM support 🚀

Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.

Processing at ~350 images/sec on A100

Using @hf.co Jobs + uv - zero setup batch OCR!

Will share final time + cost when done!

October 22, 2025 at 7:20 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Want an easy way to edit the output from Dots.OCR? Introducing Dots.OCR editor, an easy way to edit outputs from the model.

Features:
1) Edit bounding boxes
2) Edit OCR
3) Edit reading order
4) Group sections (good for newspapers)

Vibe coded with Claude 4.5

github.com/wjbmattingly...

GitHub - wjbmattingly/dots-ocr-editor

Contribute to wjbmattingly/dots-ocr-editor development by creating an account on GitHub.

github.com

October 21, 2025 at 4:21 PM

Reposted by William J.B. Mattingly

Daniel van Strien

@danielvanstrien.bsky.social

Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...

October 16, 2025 at 1:22 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

🚨Job ALERT🚨! My old postdoc is available!

I cannot emphasize enough how much a life-altering position this was for me. It gave me the experience that I needed for my current role. As a postdoc, I was able to define my projects and acquire a lot of new skills as well as refine some I already had.

September 24, 2025 at 1:50 PM

Reposted by William J.B. Mattingly

Ben Lee

@bcgl.bsky.social

Excited to be co-editing a special issue of @dhquarterly.bsky.social on Artificial Intelligence for Digital Humanities: Research problems and critical approaches
dhq.digitalhumanities.org/news/news.html

We're inviting abstracts now - please feel free to reach out with any questions!

DHQ: Digital Humanities Quarterly: News

dhq.digitalhumanities.org

September 9, 2025 at 8:28 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Something I've realized over the last couple weeks with finetuning various VLMs is that we just need more data. Unfortunately, that takes a lot of time. That's why I'm returning to my synthetic HTR workflow. This will be packaged now and expanded to work with other low-resource languages. Stay tuned

August 14, 2025 at 4:08 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

I've been getting asked training scripts when a new VLM drops. Instead of scripts, I'm going to start updating this new Python package. It's not fancy. It's for full finetunes. This was how I first trained Qwen 2 VL last year.

August 13, 2025 at 7:38 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Let's go! Training LFM2-VL 1.6B on Catmus dataset on @hf.co now. Will start posting some benchmarks on this model soon.

August 13, 2025 at 4:58 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Training on full catmus now and the results after first checkpoint are very promising. Character and massive word-level improvement.

August 13, 2025 at 3:53 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

LiquidAI cooked with LFM2-VL. At the risk of sounding like an X AI influencer, don't sleep on this model. I'm finetuning right now on Catmus. A small test over night on only 3k examples is showing remarkable improvement. Training now on 150k samples. I see this as potentially replacing TrOCR.

August 13, 2025 at 3:01 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

New super lightweight VLM just dropped from Liquid AI in two flavors: 450M and 1.6B. Both models can work out-of-the-box with medieval Latin at the line level. I'm fine-tuning on Catmus/medieval right now on an h200.

August 12, 2025 at 7:18 PM

Reposted by William J.B. Mattingly

Rainer Simon

@aboutgeo.bsky.social

With #IMMARKUS, you can already use popular AI services for image transcription. Now, you can also use them for translation! Transcribe a historic source, select the annotation—and translate it with a click.

August 12, 2025 at 9:48 AM

William J.B. Mattingly

@wjbmattingly.bsky.social

GLM-4.5V with line-level transcription of medieval Latin in Caroline Miniscule. Inference was run through @hf.co Inferencevia Novita.

August 12, 2025 at 1:50 AM

William J.B. Mattingly

@wjbmattingly.bsky.social

Qwen 3-4B Thinking finetune nearly ready to share. It can convert unstructured natural language, non-linkedart JSON, and HTML into LinkedArt JSON.

August 11, 2025 at 8:07 PM

Reposted by William J.B. Mattingly

Claire

@chirila.bsky.social

Discover Magazine did a nice feature on the Voynich Manuscript. I had a delightful conversation with Sam Waters, and here's the result. (there's also a print version in the current issue) #linguistics #Yale #YaleLibrary #BeineckeLibrary #conlangs

www.discovermagazine.com/was-the-worl...

Was the World’s Most Mysterious Manuscript from the Middle Ages A Hoax?

Its indecipherable text and illustrations have stumped scholars throughout history. Here’s what we know about the Voynich Manuscript’s content and creation, and whether the text is truly as mystifying...

www.discovermagazine.com

August 8, 2025 at 10:20 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

I spent the last 24 hours finetuning Dots.OCR with different datasets. Here are some of the things I learned... TLDR Don't sleep on this model! If you do HTR or OCR try this.

August 7, 2025 at 1:36 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Vibe coding is great for quick visualizers for projects. Single Claude 4 Sonnet prompt and viola.

August 6, 2025 at 8:45 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Finetune of Dots.OCR working! The output isn't quite there yet. Still a lot more training to go, this is just a checkpoint. This is far better than the default model, though! It learns to adjust the bboxes to your style very early in the training, so layout parsing is easier to work into the model.

August 6, 2025 at 3:16 PM

William J.B. Mattingly

@wjbmattingly.bsky.social

Woot first got the first finetune of dot.ocr on @hf.co ! Going to share the training script hopefully today. This finetune is for old church slavonic.

August 6, 2025 at 1:14 PM

Reposted by William J.B. Mattingly

A. v. Stockhausen

@vonstockhausen.org

Yeah, really impressive. There are some minor errors, but really much better than anything I've tried until now.

August 6, 2025 at 11:59 AM

William J.B. Mattingly

@wjbmattingly.bsky.social

@ponteineptique.bsky.social any interest in finetuning this with me on HTR-United data? I'm thinking the logical place is to start with Catmus segmented with the HTR data. It can handle the layout parsing and HTR, so is there a good way to get both data for all catmus?

August 5, 2025 at 3:37 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news