Davide Paglieri
banner
dpaglieri.bsky.social
Davide Paglieri
@dpaglieri.bsky.social
PhD Student at UCL.
Previously AI Research Engineer at Bending Spoons
🚀 BALROG is open submission! We welcome submission of new foundation models and new agentic pipelines.

Check it out here:
github.com/balrog-ai/BA...
GitHub - balrog-ai/BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Benchmarking Agentic LLM and VLM Reasoning On Games - balrog-ai/BALROG
github.com
January 16, 2025 at 11:30 AM
This suggests that high performance on popular static benchmarks does not necessarily translate to dynamic agentic tasks, and training data contamination may also play a role.

🆕BALROG introduces a new type of agentic benchmark designed to be robust to train data contamination.
January 16, 2025 at 11:30 AM
Interested in submitting to BALROG? Check out the instructions here!

balrogai.com/submit.html

Some big models we are looking to evaluate:

OpenAI O1
Gemini 2.0 Flash
Grok-2
Llama-3.1-405B
Pixtral-120B
Mistral-Large (123B)

If you have resources to contribute, feel free to reach out!
BALROG
BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games
balrogai.com
December 12, 2024 at 11:30 AM
Llama-3.3-70B-it 🫤 -> Not as good as the 3.1-70B version on BALROG's tasks.

Claude 3.5 Haiku✨ -> A little gem, the best of the smaller closed-source models. It even gets 1.1% progression on NetHack! 🏰 Was it trained on NLE? 🤔

Mistral-Nemo-it 🆗 -> Okay for its size (12B)
December 12, 2024 at 11:30 AM
🚨 BALROG is LIVE 🚨

🔗 Website with leaderboard: balrogai.com
📰 Paper: arxiv.org/abs/2411.13543
📜 Code: github.com/balrog-ai/BA...

No more excuses about saturated or lack of Agentic LLM/VLM benchmarks. BALROG is here!
BALROG
BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games
balrogai.com
November 21, 2024 at 4:24 PM
The ultimate test? NetHack 🏰

This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔

They’re nowhere near capable enough yet!
November 21, 2024 at 4:24 PM
And the results are in!

🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!

🧠 Curious about how your model stacks up? Submit now!
November 21, 2024 at 4:24 PM
What makes BALROG unique?

✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies

It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀
November 21, 2024 at 4:24 PM
BALROG brings together 6 challenging RL environments, including Crafter, BabaIsAI and the notoriously challenging NetHack.

BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch
November 21, 2024 at 4:24 PM