@JoelNiklaus @_lewtun @ailozovskaya @clefourrier @alvind319 HERIUN @_EldarKurtic @mariagrandury jnanliu @qubvelx
Check out the release & try it out:
🔗 github.com/huggingface...
@JoelNiklaus @_lewtun @ailozovskaya @clefourrier @alvind319 HERIUN @_EldarKurtic @mariagrandury jnanliu @qubvelx
Check out the release & try it out:
🔗 github.com/huggingface...
vLLM defaults
tokenizer quirks
crash-proofing around missing repos
metric alignment with published literature
Rock-solid across the board. 🪨 🪨
vLLM defaults
tokenizer quirks
crash-proofing around missing repos
metric alignment with published literature
Rock-solid across the board. 🪨 🪨
Hugging Face Hub inference for LLM-as-Judge
CoT prompting in vLLM
W&B logging to track everything
Hugging Face Hub inference for LLM-as-Judge
CoT prompting in vLLM
W&B logging to track everything
Bring any model to lighteval, your backend, your rules.
Use Lighteval to benchmark it like any other supported model.
This makes evals reproducible & comparable on your backends 💥
Bring any model to lighteval, your backend, your rules.
Use Lighteval to benchmark it like any other supported model.
This makes evals reproducible & comparable on your backends 💥
ARC-AGI 2
SimpleQA
Improved accuracy for AIME, GPQA, MATH-500
ARC-AGI 2
SimpleQA
Improved accuracy for AIME, GPQA, MATH-500
Built by Hugging Face's OpenEvals team, it lets you:
Run standardized benchmarks
Compare models and backends side-by-side
Scale across backends (vLLM, HF Hub, litellm, nanotron, sglang, transformers, etc.)
Built by Hugging Face's OpenEvals team, it lets you:
Run standardized benchmarks
Compare models and backends side-by-side
Scale across backends (vLLM, HF Hub, litellm, nanotron, sglang, transformers, etc.)
Check out the repo and start evaluating!
Check out the repo and start evaluating!
Though this might be explained by the dataset not having any question affected by those subjects.
4/N
Though this might be explained by the dataset not having any question affected by those subjects.
4/N
lighteval aims to gather all benchmarks scattered across repos to make them simple to use with hundreds of popular benchmarks and many backends available for your models
github.com/huggingface...
3/N
lighteval aims to gather all benchmarks scattered across repos to make them simple to use with hundreds of popular benchmarks and many backends available for your models
github.com/huggingface...
3/N
2/N
2/N
Here, LLaMA 4 Maverick performs well: 86%, below the previous generation of llama models but above other models.
6/6
Here, LLaMA 4 Maverick performs well: 86%, below the previous generation of llama models but above other models.
6/6
5/6
5/6
This is where things fall a bit short. LLaMA 4 scores only 10–23%, while DeepSeek scores 33%, and Gemma 3 27B 20%
These tasks test real multi-step symbolic reasoning, and where LLaMA 4 really struggles, some more fine-tuning might fix that!
4/6
This is where things fall a bit short. LLaMA 4 scores only 10–23%, while DeepSeek scores 33%, and Gemma 3 27B 20%
These tasks test real multi-step symbolic reasoning, and where LLaMA 4 really struggles, some more fine-tuning might fix that!
4/6
LLaMA 4 Maverick reaches 70%, which is decent — but not SOTA. However, DeepSeek V3 outperforms at 73%, with it being much bigger.
3/6
LLaMA 4 Maverick reaches 70%, which is decent — but not SOTA. However, DeepSeek V3 outperforms at 73%, with it being much bigger.
3/6
2/6
huggingface.co/spaces/Sayl...
2/6
huggingface.co/spaces/Sayl...