Lightnews — Scholar-powered news

📱 Real-world Efficiency: We've created an app using SmolVLM on an iPhone 15 and got real-time inference directly from its camera!
🌐 Browser-based Inference? Yep! We get lightning-fast inference speeds of 40-80 tokens per second directly in a web browser. No tricks, just compact, efficient models!

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

🌟 State-of-the-Art Performance, SmolVLM comes in three powerful yet compact sizes—256M, 500M, and 2.2B parameters—each setting new SOTA benchmarks for their hardware constraints in image and video understanding.

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

✨ Less CoT, more efficiency: Turns out, too much Chain-of-Thought (CoT) data actually hurts performance in small models. They dumb
✨ Longer videos, better results: Increasing video length during training enhanced performance on both video and image tasks.

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

✨ System prompts and special tokens are key: Introducing system prompts and dedicated media intro/outro tokens significantly boosted our compact VLM’s performance—especially for video tasks.

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

✨ Pixel shuffling magic: Aggressively pixel shuffling helped our compact VLMs "see" better, same performance with sequences 16x shorter!
✨ Learned positional tokens FTW: For compact models, learned positional tokens significantly outperform raw text tokens, enhancing efficiency and accuracy.

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

✨ Smaller is smarter with SigLIP: Surprise! Smaller LLMs didn't benefit from the usual large SigLIP (400M). Instead, we use the 80M base SigLIP that performs equally well at just 20% of the original size!

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

Here are the coolest insights from our experiments:
✨ Longer context = Big wins: Increasing the context length from 2K to 16K gave our tiny VLMs a 60% performance boost!

April 8, 2025 at 3:12 PM

Andi

@andimara.bsky.social

SmolDocling is available today 🏗️

🔗 Model: huggingface.co/ds4sd/SmolDo...
📖 Paper: huggingface.co/papers/2503....
🤗 Space: huggingface.co/spaces/ds4sd...

Try it and let us know what you think! 💬

ds4sd/SmolDocling-256M-preview · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

March 17, 2025 at 3:53 PM

Andi

@andimara.bsky.social

At only 256M parameters, SmolDocling outperforms much larger models on key document conversion tasks:
🖋️ Full-page transcription: Beats models 27× bigger!
📑 Equations: Matches or beats leading models like GOT
💻 Code recognition: We introduce the first benchmark for code OCR

March 17, 2025 at 3:53 PM

Andi

@andimara.bsky.social

What makes it unique?
📌 Handles everything a document has: tables, charts, code, equations, lists, and more
📌 Works beyond scientific papers—supports business docs, patents, and forms
📌 It runs with less than 1GB of RAM, so running at large batch sizes is super cheap!

March 17, 2025 at 3:53 PM

Andi

@andimara.bsky.social

How does SmolDocling beat models 27× bigger? SmolDocling transforms any document into structured metadata with DocTags, being SOTA in:

✅ Full-page conversion
✅ Layout identification
✅ Equations, tables, charts, plots, code OCR

March 17, 2025 at 3:53 PM

Andi

@andimara.bsky.social

Me too! Highlight of my career so far :)

January 31, 2025 at 3:21 PM

Andi

@andimara.bsky.social

And that was why we didn't release this before. It's live research code. Most gets rewritten fairly often, and some parts have been the same for years.
It works, it manages to produce SOTA results at 256M and 80B sizes, but it's not production code.
Go check it out:
github.com/huggingface/...

smollm/vision at main · huggingface/smollm

Everything about the SmolLM2 and SmolVLM family of models - huggingface/smollm

github.com

January 31, 2025 at 3:06 PM

Andi

@andimara.bsky.social

And it also has a bunch of bugs like this one in our modeling_vllama3.py file. We start from a pretrained LLM, but for some reason the weights of the head are not loaded from the model. I still don't know why :(

January 31, 2025 at 3:06 PM

Andi

@andimara.bsky.social

The codebase is full of interesting insights like this one in our dataset.py file: How do you get reproducible randomness in different processes across different machines?
Start different random number generators based on a tuple (seed, rank)!

January 31, 2025 at 3:06 PM

Andi

@andimara.bsky.social

Post training, you can run the evaluation on all of these tasks by running:
sbatch vision/experiments/evaluation/vloom/async_evals_tr_346/run_evals_0_shots_val_2048 . slurm

January 31, 2025 at 3:06 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news