Lightnews — Scholar-powered news

Nathan

@saylortwift.hf.co

github.com/huggingface...

GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval

github.com

June 25, 2025 at 3:06 PM

Nathan

@saylortwift.hf.co

Major props to the contributors who made this release happen 🙌
@JoelNiklaus @_lewtun @ailozovskaya @clefourrier @alvind319 HERIUN @_EldarKurtic @mariagrandury jnanliu @qubvelx

Check out the release & try it out:
🔗 github.com/huggingface...

GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval

github.com

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

🐛 Tons of bug fixes:

vLLM defaults
tokenizer quirks
crash-proofing around missing repos
metric alignment with published literature

Rock-solid across the board. 🪨 🪨

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

✨ Bonus goodies:

Hugging Face Hub inference for LLM-as-Judge
CoT prompting in vLLM
W&B logging to track everything

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

🧠 Custom Model Inference is here.

Bring any model to lighteval, your backend, your rules.
Use Lighteval to benchmark it like any other supported model.

This makes evals reproducible & comparable on your backends 💥

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

📊 New Benchmarks added and others improved:

ARC-AGI 2
SimpleQA
Improved accuracy for AIME, GPQA, MATH-500

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

⚡️ Lighteval = your go-to tool for lightning-fast LLM evaluation.
Built by Hugging Face's OpenEvals team, it lets you:

Run standardized benchmarks
Compare models and backends side-by-side
Scale across backends (vLLM, HF Hub, litellm, nanotron, sglang, transformers, etc.)

May 6, 2025 at 2:26 PM

Nathan

@saylortwift.hf.co

open models have trouble on this benchmark, though because it was made by openai, the results of their models should not be considered.

Check out the repo and start evaluating!

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

funny enough we can see that @perplexity_ai 's r1-1776 does worse than @deepseek_ai r1, even though this model is supposed to remove chinese censorship 🤔
Though this might be explained by the dataset not having any question affected by those subjects.

4/N

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

All were run with 100 samples of the dataset.
lighteval aims to gather all benchmarks scattered across repos to make them simple to use with hundreds of popular benchmarks and many backends available for your models

github.com/huggingface...

3/N

GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval

github.com

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

Here is an example of question and answer from Claude 3.7 Sonnet

2/N

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

📋 IFEval (Instruction Following)

Here, LLaMA 4 Maverick performs well: 86%, below the previous generation of llama models but above other models.

6/6

April 8, 2025 at 8:53 AM

Nathan

@saylortwift.hf.co

The score on AIME24 also indicates contamination, though this is to be expected for recent models. This means we should look at the AIME25 performance!

5/6

April 8, 2025 at 8:53 AM

Nathan

@saylortwift.hf.co

🧮 AIME 2024/2025 (math)

This is where things fall a bit short. LLaMA 4 scores only 10–23%, while DeepSeek scores 33%, and Gemma 3 27B 20%

These tasks test real multi-step symbolic reasoning, and where LLaMA 4 really struggles, some more fine-tuning might fix that!

4/6

April 8, 2025 at 8:53 AM

Nathan

@saylortwift.hf.co

🧠 GPQA (Graduate-level reasoning)

LLaMA 4 Maverick reaches 70%, which is decent — but not SOTA. However, DeepSeek V3 outperforms at 73%, with it being much bigger.

3/6

April 8, 2025 at 8:53 AM