Lightnews — Scholar-powered news

Reposted by Daniel van Strien

IIIF Consortium

@iiif.bsky.social

Join us tomorrow for a demo of IIIF Illustration Detector!

Zoom link: iiif.io/community

Join us Feburary 11 for a demo of @danielvanstrien.bsky.social's IIIF Illustration Detector.

Zoom on the IIIF Community Calendar: iiif.io/community

IIIF Community Call: IIIF Illustration Detector, Wednesday, February 11 (9am PT / 12pm ET / 5pm UTC

February 10, 2026 at 5:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

Datasets and benchmarks drive AI progress, but finding papers that introduce new ones means digging through thousands of arXiv abstracts.

Updated the Dataset Papers on ArXiv app to surface them: 52K+ papers classified as introducing new datasets from 212K CS papers.

February 9, 2026 at 10:13 AM

Reposted by Daniel van Strien

IIIF Consortium

@iiif.bsky.social

Join us Feburary 11 for a demo of @danielvanstrien.bsky.social's IIIF Illustration Detector.

Zoom on the IIIF Community Calendar: iiif.io/community

February 3, 2026 at 7:45 PM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...)

SAM3 on HF Jobs → correct the errors → train YOLO → repeat.

Three rounds: 31% → 99% accuracy on historical index cards from @natlibscot.bsky.social

image of a index card with a green bounding box prediction around the card contents

February 2, 2026 at 4:43 PM

Daniel van Strien

@danielvanstrien.bsky.social

Built an object detector from zero-labelled data in one afternoon with help from Claude Code (it can do more than vibe code, TODO apps...)

SAM3 on HF Jobs → correct the errors → train YOLO → repeat.

Three rounds: 31% → 99% accuracy on historical index cards from @natlibscot.bsky.social

February 2, 2026 at 4:43 PM

Reposted by Daniel van Strien

Paul Fairie

@paulisci.bsky.social

We used to do real science

SCIENTISTS QUARREL OVER MARTIAN WOMEN

One Says Ladies Have Two Thumbs and X-Ray Eyes

ANOTHER SAYS BIG EARS

--The Commercial Appeal, 22 Oct 1928

January 12, 2026 at 1:59 AM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.

I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬.

It finds illustrated pages in historical books. No server. No GPU.

December 19, 2025 at 12:08 PM

Daniel van Strien

@danielvanstrien.bsky.social

Built a 2.5MB image classifier that runs in the browser in an evening with Claude Code.

I used a dataset I labelled in 2022 and left on @hf.co for 3 years 😬.

It finds illustrated pages in historical books. No server. No GPU.

December 19, 2025 at 12:08 PM

Daniel van Strien

@danielvanstrien.bsky.social

Just posted my slides from the AI4LAM #FF2025 workshop on open source AI for GLAMs.

Probably slides on their own aren't that useful, but they do feature one of my growing collection of libraries-and-AI memes, so there's that danielvanstrien.xyz/slides.html

Here's alt text for the meme:

Alt text: "Flex Tape meme format. Top panel: Phil Swift (labeled 'Library Systems Vendor') aggressively spraying water representing 'Outdated systems, metadata issues, disjointed search and complex user needs.' Bottom panel: A hand slapping Flex Tape underwater, with the tape labeled 'AI-powered chat interface.'"

This captures the joke that vendors are positioning AI chat as a quick fix for deep-seated library infrastructure problems—a bit like slapping tape on a leak rather than fixing the plumbing.

December 9, 2025 at 10:13 AM

Daniel van Strien

@danielvanstrien.bsky.social

At the AI4LAM Fantastic Futures conference this week

Happy to chat about @hf.co, open source AI for GLAMs, or why cultural heritage should bet on small, focused models over closed-source giants!

DM or find me at breaks! #AI4LAM #FF2025

December 1, 2025 at 11:13 AM

Daniel van Strien

@danielvanstrien.bsky.social

Building datasets to train smaller, task-focused models used to be incredibly time-consuming.

Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!

Try it yourself: huggingface.co/datasets/uv-...!

Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

$hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4$

November 21, 2025 at 1:30 PM

Daniel van Strien

@danielvanstrien.bsky.social

Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools.

Be warned, I did fire up the meme generator for my slides...

Meme showing two versions of the doge dog.
Top: muscular buff doge labeled 'Public Domain circa 2000' saying 'Open access to knowledge and culture is a universal good'.
Bottom: small weak doge labeled 'Public Domain post LLMs?' saying 'Someone might use public domain content for training an LLM' with worried emoji.

November 5, 2025 at 5:40 PM

Reposted by Daniel van Strien

William J.B. Mattingly

@wjbmattingly.bsky.social

Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @hf.co . The first version of the models are now available on the Small Models for GLAM organization with @danielvanstrien.bsky.social (Links below) Working on improving them further.

October 24, 2025 at 2:59 PM

Daniel van Strien

@danielvanstrien.bsky.social

DeepSeek-OCR just got vLLM support 🚀

Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.

Processing at ~350 images/sec on A100

Using @hf.co Jobs + uv - zero setup batch OCR!

Will share final time + cost when done!

October 22, 2025 at 7:20 PM

Reposted by Daniel van Strien

Tom Aarsen

@tomaarsen.com

🤗 Sentence Transformers is joining @hf.co! 🤗

This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!

Details in 🧵

October 22, 2025 at 1:04 PM

Daniel van Strien

@danielvanstrien.bsky.social

OCR is one of AI's oldest challenges (first systems: early 1900s!)

Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction...

New @hf.co guide on OCR

huggingface.co/blog/ocr-ope...

Supercharge your OCR Pipelines with Open Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

October 22, 2025 at 8:58 AM

Daniel van Strien

@danielvanstrien.bsky.social

Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...

October 16, 2025 at 1:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

Very nice work! IMO, this is the kind of topic that more libraries/GLAM/DH people should be working on. The training of these models is *relatively* simple. As always, the missing ingredient is readily accessible data.

Thibault Clérice @ponteineptique.bsky.social · Oct 15

It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1

(1/🧵)

CoMMA

comma.inria.fr

October 15, 2025 at 3:55 PM

Reposted by Daniel van Strien

Thibault Clérice

@ponteineptique.bsky.social

It's been brewing for months: @inriaparisnlp.bsky.social releases CoMMA (Corpus of Multilingual Medieval Archives) !

📚 2.5bn tokens of mostly Latin and French texts
🕰️ 800→1600 CE
📜 23k manuscripts
🖥️ 18k on the reading interface: comma.inria.fr
🔍 Paper: inria.hal.science/hal-05299220v1

(1/🧵)

CoMMA

comma.inria.fr

October 15, 2025 at 2:51 PM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

Another week, another VLM-based OCR model!

Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄

You can run it with one command on @hf.co Jobs (no local GPU needed)

screenshot of the hf jobs command to run the model.

October 13, 2025 at 6:13 PM

Daniel van Strien

@danielvanstrien.bsky.social

Another week, another VLM-based OCR model!

Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄

You can run it with one command on @hf.co Jobs (no local GPU needed)

October 13, 2025 at 6:13 PM

Reposted by Daniel van Strien

Daniel van Strien

@danielvanstrien.bsky.social

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

Screenshot of an index card with annotated bounding box predictions from the ocr model

October 7, 2025 at 3:45 PM

Daniel van Strien

@danielvanstrien.bsky.social

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

October 7, 2025 at 3:45 PM

Daniel van Strien

@danielvanstrien.bsky.social

Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that.

I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.

October 6, 2025 at 9:30 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news