Previously AI Research Engineer at Bending Spoons
While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.
While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.
This week's new entries on balrogai.com are:
Llama 3.3 70B Instruct 🫤
Claude 3.5 Haiku✨
Mistral-Nemo-it (12B) 🆗
Github: github.com/balrog-ai/BA...
This week's new entries on balrogai.com are:
Llama 3.3 70B Instruct 🫤
Claude 3.5 Haiku✨
Mistral-Nemo-it (12B) 🆗
Github: github.com/balrog-ai/BA...
Check out what he had to say about it here:
jack-clark.net
And check out BALROG's leaderboard on balrogai.com
Check out what he had to say about it here:
jack-clark.net
And check out BALROG's leaderboard on balrogai.com
This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔
They’re nowhere near capable enough yet!
This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔
They’re nowhere near capable enough yet!
🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!
🧠 Curious about how your model stacks up? Submit now!
🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!
🧠 Curious about how your model stacks up? Submit now!
✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies
It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀
✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies
It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀
BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch
BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch
🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!
BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.
1/🧵
🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!
BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.
1/🧵