Nathan
banner
saylortwift.hf.co
Nathan
@saylortwift.hf.co
ML engineer at @huggingface 🤗, Evaluation, Open LLM Leaderboard and lighteval
Major props to the contributors who made this release happen 🙌
@JoelNiklaus @_lewtun @ailozovskaya @clefourrier @alvind319 HERIUN @_EldarKurtic @mariagrandury jnanliu @qubvelx

Check out the release & try it out:
🔗 github.com/huggingface...
GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval
github.com
May 6, 2025 at 2:26 PM
🐛 Tons of bug fixes:

vLLM defaults
tokenizer quirks
crash-proofing around missing repos
metric alignment with published literature

Rock-solid across the board. 🪨 🪨
May 6, 2025 at 2:26 PM
✨ Bonus goodies:

Hugging Face Hub inference for LLM-as-Judge
CoT prompting in vLLM
W&B logging to track everything
May 6, 2025 at 2:26 PM
🧠 Custom Model Inference is here.

Bring any model to lighteval, your backend, your rules.
Use Lighteval to benchmark it like any other supported model.

This makes evals reproducible & comparable on your backends 💥
May 6, 2025 at 2:26 PM
📊 New Benchmarks added and others improved:

ARC-AGI 2
SimpleQA
Improved accuracy for AIME, GPQA, MATH-500
May 6, 2025 at 2:26 PM
⚡️ Lighteval = your go-to tool for lightning-fast LLM evaluation.
Built by Hugging Face's OpenEvals team, it lets you:

Run standardized benchmarks
Compare models and backends side-by-side
Scale across backends (vLLM, HF Hub, litellm, nanotron, sglang, transformers, etc.)
May 6, 2025 at 2:26 PM
open models have trouble on this benchmark, though because it was made by openai, the results of their models should not be considered.

Check out the repo and start evaluating!
April 22, 2025 at 2:29 PM
funny enough we can see that @perplexity_ai 's r1-1776 does worse than @deepseek_ai r1, even though this model is supposed to remove chinese censorship 🤔
Though this might be explained by the dataset not having any question affected by those subjects.

4/N
April 22, 2025 at 2:29 PM
All were run with 100 samples of the dataset.
lighteval aims to gather all benchmarks scattered across repos to make them simple to use with hundreds of popular benchmarks and many backends available for your models

github.com/huggingface...

3/N
GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval
github.com
April 22, 2025 at 2:29 PM
Here is an example of question and answer from Claude 3.7 Sonnet

2/N
April 22, 2025 at 2:29 PM
📋 IFEval (Instruction Following)

Here, LLaMA 4 Maverick performs well: 86%, below the previous generation of llama models but above other models.

6/6
April 8, 2025 at 8:53 AM
The score on AIME24 also indicates contamination, though this is to be expected for recent models. This means we should look at the AIME25 performance!

5/6
April 8, 2025 at 8:53 AM
🧮 AIME 2024/2025 (math)

This is where things fall a bit short. LLaMA 4 scores only 10–23%, while DeepSeek scores 33%, and Gemma 3 27B 20%

These tasks test real multi-step symbolic reasoning, and where LLaMA 4 really struggles, some more fine-tuning might fix that!

4/6
April 8, 2025 at 8:53 AM
🧠 GPQA (Graduate-level reasoning)

LLaMA 4 Maverick reaches 70%, which is decent — but not SOTA. However, DeepSeek V3 outperforms at 73%, with it being much bigger.

3/6
April 8, 2025 at 8:53 AM
Here are the details of the evaluations, where you can find all the samples for the models!

2/6

huggingface.co/spaces/Sayl...
OpenEvalsDetails - a Hugging Face Space by SaylorTwift
huggingface.co
April 8, 2025 at 8:53 AM
Second tweet is for the link hehe, go check out the demo !

huggingface.co/spaces/your...
Try YourBench! - a Hugging Face Space by yourbench
huggingface.co
April 3, 2025 at 9:35 AM