Lightnews — Scholar-powered news

Daniel van Strien

@danielvanstrien.bsky.social

Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools.

Be warned, I did fire up the meme generator for my slides...

Meme showing two versions of the doge dog.
Top: muscular buff doge labeled 'Public Domain circa 2000' saying 'Open access to knowledge and culture is a universal good'.
Bottom: small weak doge labeled 'Public Domain post LLMs?' saying 'Someone might use public domain content for training an LLM' with worried emoji.

November 5, 2025 at 5:40 PM

Reposted by Daniel van Strien

William J.B. Mattingly

@wjbmattingly.bsky.social

Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.

October 24, 2025 at 2:59 PM

Daniel van Strien

@danielvanstrien.bsky.social

DeepSeek-OCR just got vLLM support 🚀

Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.

Processing at ~350 images/sec on A100

Using @hf.co Jobs + uv - zero setup batch OCR!

Will share final time + cost when done!

October 22, 2025 at 7:20 PM

Reposted by Daniel van Strien

Tom Aarsen

@tomaarsen.com

🤗 Sentence Transformers is joining @hf.co! 🤗

This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!

Details in 🧵

October 22, 2025 at 1:04 PM

Daniel van Strien

@danielvanstrien.bsky.social

OCR is one of AI's oldest challenges (first systems: early 1900s!)

Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction...

New @hf.co guide on OCR

huggingface.co/blog/ocr-ope...

Supercharge your OCR Pipelines with Open Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

October 22, 2025 at 8:58 AM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...

October 16, 2025 at 1:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...

October 16, 2025 at 1:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.

Thibault Clérice @ponteineptique.bsky.social · 27d

It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1

(1/🧵)

CoMMA

comma.inria.fr

October 15, 2025 at 3:55 PM

Reposted by Daniel van Strien

Thibault Clérice

@ponteineptique.bsky.social

It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1

(1/🧵)

CoMMA

comma.inria.fr

October 15, 2025 at 2:51 PM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Another week, another VLM-based OCR model!

Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄

You can run it with one command on @hf.co Jobs (no local GPU needed)

screenshot of the hf jobs command to run the model.

October 13, 2025 at 6:13 PM

Daniel van Strien

@danielvanstrien.bsky.social

Another week, another VLM-based OCR model!

Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄

You can run it with one command on @hf.co Jobs (no local GPU needed)

October 13, 2025 at 6:13 PM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

Screenshot of an index card with annotated bounding box predictions from the ocr model

October 7, 2025 at 3:45 PM

Daniel van Strien

@danielvanstrien.bsky.social

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

October 7, 2025 at 3:45 PM

Daniel van Strien

@danielvanstrien.bsky.social

Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that.

I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.

October 6, 2025 at 9:30 AM

Reposted by Daniel van Strien

Jay 🦋

@jay.bsky.team

We’re hiring for two machine learning roles. A chance to do cutting edge things with ML to make this place a lot more personalized.

jobs.gem.com/bluesky/am9i...

Bluesky Jobs

jobs.gem.com

October 1, 2025 at 1:14 AM

Daniel van Strien

@danielvanstrien.bsky.social

New @hf.co BigLAM dataset: 9,363 OA books with page images + rich MARC metadata for evaluating (and training) VLMs on metadata extraction.

Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap.

huggingface.co/datasets/big...

Screenshot of the dataset viewer showing a column of marc data + the first few pages of an open access monograph

October 2, 2025 at 6:51 PM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Blogged: Fine-tuning a VLM for art history in hours, not weeks

iconclass-vlm generates museum catalog codes (fun fact: "71H7131" = "Bathsheba with David's letter"!)

@hf.co TRL + Jobs = magic ✨

Guide here: danielvanstrien.xyz/posts/2025/i...

danielvanstrien.xyz

September 4, 2025 at 8:19 PM

Daniel van Strien

@danielvanstrien.bsky.social

Blogged: Fine-tuning a VLM for art history in hours, not weeks

iconclass-vlm generates museum catalog codes (fun fact: "71H7131" = "Bathsheba with David's letter"!)

@hf.co TRL + Jobs = magic ✨

Guide here: danielvanstrien.xyz/posts/2025/i...

danielvanstrien.xyz

September 4, 2025 at 8:19 PM

Daniel van Strien

@danielvanstrien.bsky.social

I fine-tuned a smol VLM to generate specialized art history metadata!

iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with @hf.co TRL + Jobs - single UV script, no GPU needed!

Blog soon!

Screenshot of the iconclass-vlm model demo showing predictions for a 17th century portrait painting of a standing woman in black dress with white ruff collar. The interface displays the model's raw JSON prediction with ICONCLASS codes, then compares predictions against ground truth labels in two columns. Model correctly identifies "31A231 standing figure" and "61B(+55) historical persons (portraits and scenes from the life) (+ full length portrait)" among others, achieving 3 out of 6 matches. Some predictions marked as "Not a valid iconclass label" showing areas where the model needs improvement.

September 3, 2025 at 6:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

What if OCR models could show you their thought process?

NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.

Could be pretty valuable for weird historical documents?

Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...

Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

August 7, 2025 at 3:16 PM

Daniel van Strien

@danielvanstrien.bsky.social

You can now generate synthetic data using OpenAIs GPT OSS models on @hf.co Jobs!

One command, no setup:

hf jobs uv run --flavor l4x4 [script-url] \
--input-dataset your/dataset \
--output-dataset your/output

Works on L4 GPUs ⚡

huggingface.co/datasets/uv-...

uv-scripts/openai-oss · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

August 6, 2025 at 7:38 AM

Reposted by Daniel van Strien

Toby Hodges

@tobyhodges.carpentries.org

My latest post on @carpentries.carpentries.org blog is a call to action for the community to engage in curriculum development for workshops about genAI. How can we help our target audience make more informed choices about when and how to use it?

carpentries.org/blog/2025/08...

AI Carpentry? Helping learners make better choices with genAI.

In two recent community discussion sessions, we explored what mental model of machine learning/deep learning we could teach to learners already familiar with the basics of programming, to help them sa...

carpentries.org

August 5, 2025 at 9:24 AM

Daniel van Strien

@danielvanstrien.bsky.social

I’m continuing my experiments with VLM-based OCR…

How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?

RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)

@hf.co Demo: huggingface.co/spaces/davan...

Screenshot of a plyabill with some OCR results on the right

August 5, 2025 at 9:17 AM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

Screenshot of the app showing a page from a book + different views of existing and new ocr.

August 1, 2025 at 3:09 PM

Daniel van Strien

@danielvanstrien.bsky.social

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

August 1, 2025 at 3:09 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news