Lightnews — Scholar-powered news

Daniel van Strien

@danielvanstrien.bsky.social

Very much looking forward to presenting at this tomorrow. I will be making my usual pitch that datasets are the foundational infrastructure for cultural heritage to benefit from and create useful AI models and tools.

Be warned, I did fire up the meme generator for my slides...

Meme showing two versions of the doge dog.
Top: muscular buff doge labeled 'Public Domain circa 2000' saying 'Open access to knowledge and culture is a universal good'.
Bottom: small weak doge labeled 'Public Domain post LLMs?' saying 'Someone might use public domain content for training an LLM' with worried emoji.

November 5, 2025 at 5:40 PM

Daniel van Strien

@danielvanstrien.bsky.social

huggingface.co/nanonets/Nan... might be worth a try for this. Can extract formulas into LaTeX

October 23, 2025 at 2:01 PM

Daniel van Strien

@danielvanstrien.bsky.social

The command (using @hf.co Jobs - serverless GPU compute)

Full script at huggingface.co/datasets/uv-...

$hf jobs uv run --flavor a100-large --timeout 2h \ -s HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/ocr/raw/main/deepseek-ocr-vllm.py \ NationalLibraryOfScotland/Britain-and-UK-Handbooks-Dataset \ davanstrien/handbooks-deep-ocr \ --resolution-mode base \ --batch-size 2048 \ --prompt-mode free$

October 22, 2025 at 7:20 PM

Daniel van Strien

@danielvanstrien.bsky.social

DeepSeek-OCR just got vLLM support 🚀

Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.

Processing at ~350 images/sec on A100

Using @hf.co Jobs + uv - zero setup batch OCR!

Will share final time + cost when done!

October 22, 2025 at 7:20 PM

Daniel van Strien

@danielvanstrien.bsky.social

Small models work great for GLAM but there aren't enough examples!

With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.

Follow the org to keep up-to-date!
huggingface.co/small-models...

October 16, 2025 at 1:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

Another week, another VLM-based OCR model!

Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄

You can run it with one command on @hf.co Jobs (no local GPU needed)

screenshot of the hf jobs command to run the model.

October 13, 2025 at 6:13 PM

Daniel van Strien

@danielvanstrien.bsky.social

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

Screenshot of an index card with annotated bounding box predictions from the ocr model

October 7, 2025 at 3:45 PM

Daniel van Strien

@danielvanstrien.bsky.social

Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that.

I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.

October 6, 2025 at 9:30 AM

Daniel van Strien

@danielvanstrien.bsky.social

New @hf.co BigLAM dataset: 9,363 OA books with page images + rich MARC metadata for evaluating (and training) VLMs on metadata extraction.

Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap.

huggingface.co/datasets/big...

Screenshot of the dataset viewer showing a column of marc data + the first few pages of an open access monograph

October 2, 2025 at 6:51 PM

Daniel van Strien

@danielvanstrien.bsky.social

I fine-tuned a smol VLM to generate specialized art history metadata!

iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with @hf.co TRL + Jobs - single UV script, no GPU needed!

Blog soon!

Screenshot of the iconclass-vlm model demo showing predictions for a 17th century portrait painting of a standing woman in black dress with white ruff collar. The interface displays the model's raw JSON prediction with ICONCLASS codes, then compares predictions against ground truth labels in two columns. Model correctly identifies "31A231 standing figure" and "61B(+55) historical persons (portraits and scenes from the life) (+ full length portrait)" among others, achieving 3 out of 6 matches. Some predictions marked as "Not a valid iconclass label" showing areas where the model needs improvement.

September 3, 2025 at 6:22 PM

Daniel van Strien

@danielvanstrien.bsky.social

Try it with one line of code via Jobs!

It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs.

See the full OCR uv scripts collection: huggingface.co/datasets/uv-...

Screenshot of a hf jobs uv run command with some flags and a URL pointing to a script.

August 7, 2025 at 3:16 PM

Daniel van Strien

@danielvanstrien.bsky.social

What if OCR models could show you their thought process?

NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.

Could be pretty valuable for weird historical documents?

Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...

Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

August 7, 2025 at 3:16 PM

Daniel van Strien

@danielvanstrien.bsky.social

I’m continuing my experiments with VLM-based OCR…

How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?

RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)

@hf.co Demo: huggingface.co/spaces/davan...

Screenshot of a plyabill with some OCR results on the right

August 5, 2025 at 9:17 AM

Daniel van Strien

@danielvanstrien.bsky.social

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

Screenshot of the app showing a page from a book + different views of existing and new ocr.

August 1, 2025 at 3:09 PM

Daniel van Strien

@danielvanstrien.bsky.social

HF Jobs just launched! 🚀

One command VLM based OCR with uv Scripts:

hf jobs uv run [script] ufo-images ufo-text

Classified UFO docs → clean markdown. Zero setup!

Try it → huggingface.co/datasets/uv-...

July 29, 2025 at 8:48 AM

Daniel van Strien

@danielvanstrien.bsky.social

465 people. 122 languages. 58,185 annotations!

FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.

Huge thanks to all who contributed!

huggingface.co/blog/davanst...

July 8, 2025 at 12:07 PM

Daniel van Strien

@danielvanstrien.bsky.social

Everyone’s dropping VLM-based OCR models lately…
But are they actually better than traditional OCR engines, which output XML for historical docs?

I built OCR Time Machine to test it!

📄 Upload image + ALTO/PAGE XML
⚖️ Compare outputs side by side
🔗 huggingface.co/spaces/davan...

Screenshot showing a document page image on the left with corresponding OCR output on the right of the page.

June 24, 2025 at 5:35 PM

Daniel van Strien

@danielvanstrien.bsky.social

The @europeana.bsky.social community has launched a “Culture for AI” assembly. It’s asking cultural heritage folks:

how should we critically engage with AI?

Can you guess how I answered the question below?!

Heritage organisations should actively develop open-source AI models to provide alternatives to big tech’s control over cultural data.

June 18, 2025 at 8:09 AM

Daniel van Strien

@danielvanstrien.bsky.social

“AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums” – interesting piece via @404media.co

Not a perfect fix, but making ML-ready datasets from collections can help.

If you want help getting your data on @hf.co, I'd be happy to help.

Screenshot of the header of the article with text:

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

June 17, 2025 at 10:43 AM

Daniel van Strien

@danielvanstrien.bsky.social

As usual, I'm allowing myself a few memes for this presentation.

The serious point of this one is that the barrier to doing data work has gotten much lower in the past year or two. You don't need to be an expert in XML to do useful stuff with XML data anymore.

Sure grandma let's get you to bed meme with text

I manually wrote XSLT to convert ALTO XML OCR into AI-ready text

Sure grandma, let's get you to bed.

June 16, 2025 at 11:10 AM

Daniel van Strien

@danielvanstrien.bsky.social

Did you set the read-only token? huggingface.co/settings/mcp (this is for the official one), for my hacked-together version, you don't need a token!

June 9, 2025 at 2:21 PM

Daniel van Strien

@danielvanstrien.bsky.social

Inspired by @hf.co's official MCP server, I built my own to expose my semantic search API for the HF ecosystem!

Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets.

Try: "Find non maths reasoning datasets from 2025"!

June 9, 2025 at 11:09 AM

Daniel van Strien

@danielvanstrien.bsky.social

This is the most exciting part of this DeepSeek release for me.
huggingface.co/deepseek-ai/...

Screenshot of this text:

Meanwhile, we distilled the chain-of-thought from DeepSeek-R1-0528 to post-train Qwen3 8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B. This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking. We believe that the chain-of-thought from DeepSeek-R1-0528 will hold significant importance for both academic research on reasoning models and industrial development focused on small-scale models.

+ a table of metrics

May 29, 2025 at 1:41 PM

Daniel van Strien

@danielvanstrien.bsky.social

🗞️ Just released a Parquet version of the Newspaper Navigator dataset on @hf.co!

- 3M+ visual elements from historic US newspapers — photos, maps, cartoons, OCR + metadata.
- Parquet = fast filters, easier analysis.
- Great for ML + cultural research.

👉 huggingface.co/datasets/big...

Screenshot of the dataset viewer on the Hugging Face Hub. Shows a set of metadata for the newspaper navigator dataset. It also has previews of a few rows showing images alongside metadata columns.

May 20, 2025 at 11:50 AM

Daniel van Strien

@danielvanstrien.bsky.social

Finally documented the Beyond Words dataset from the @librarycongress.bsky.social labs / @bcgl.bsky.social for the BigLAM @hf.co org!

- 3.5K annotated historical newspaper pages
- Bounding boxes + category labels
- Photos, ads, headlines, cartoons & more

Image of a historic newspaper with bounding box predictions for "photographs" "headline" "illustration" etc.

May 8, 2025 at 8:41 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news