Davide Paglieri
banner
dpaglieri.bsky.social
Davide Paglieri
@dpaglieri.bsky.social
PhD Student at UCL.
Previously AI Research Engineer at Bending Spoons
🚨This week's new entry on balrogai.com is Microsoft Phi-4 (14B model)

While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.
January 16, 2025 at 11:30 AM
🚨BALROG leaderboard update

This week's new entries on balrogai.com are:

Llama 3.3 70B Instruct 🫤
Claude 3.5 Haiku✨
Mistral-Nemo-it (12B) 🆗

Github: github.com/balrog-ai/BA...
December 12, 2024 at 11:30 AM
It's great to see BALROG featured on Jack Clark's Import AI newsletter!

Check out what he had to say about it here:
jack-clark.net

And check out BALROG's leaderboard on balrogai.com
December 4, 2024 at 9:37 AM
The ultimate test? NetHack 🏰

This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔

They’re nowhere near capable enough yet!
November 21, 2024 at 4:24 PM
And the results are in!

🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!

🧠 Curious about how your model stacks up? Submit now!
November 21, 2024 at 4:24 PM
What makes BALROG unique?

✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies

It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀
November 21, 2024 at 4:24 PM
BALROG brings together 6 challenging RL environments, including Crafter, BabaIsAI and the notoriously challenging NetHack.

BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch
November 21, 2024 at 4:24 PM
Tired of saturated benchmarks? Want scope for a significant leap in capabilities?

🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!

BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.

1/🧵
November 21, 2024 at 4:24 PM