Previously AI Research Engineer at Bending Spoons
Check it out here:
github.com/balrog-ai/BA...
Check it out here:
github.com/balrog-ai/BA...
🆕BALROG introduces a new type of agentic benchmark designed to be robust to train data contamination.
🆕BALROG introduces a new type of agentic benchmark designed to be robust to train data contamination.
balrogai.com/submit.html
Some big models we are looking to evaluate:
OpenAI O1
Gemini 2.0 Flash
Grok-2
Llama-3.1-405B
Pixtral-120B
Mistral-Large (123B)
If you have resources to contribute, feel free to reach out!
balrogai.com/submit.html
Some big models we are looking to evaluate:
OpenAI O1
Gemini 2.0 Flash
Grok-2
Llama-3.1-405B
Pixtral-120B
Mistral-Large (123B)
If you have resources to contribute, feel free to reach out!
Claude 3.5 Haiku✨ -> A little gem, the best of the smaller closed-source models. It even gets 1.1% progression on NetHack! 🏰 Was it trained on NLE? 🤔
Mistral-Nemo-it 🆗 -> Okay for its size (12B)
Claude 3.5 Haiku✨ -> A little gem, the best of the smaller closed-source models. It even gets 1.1% progression on NetHack! 🏰 Was it trained on NLE? 🤔
Mistral-Nemo-it 🆗 -> Okay for its size (12B)
🔗 Website with leaderboard: balrogai.com
📰 Paper: arxiv.org/abs/2411.13543
📜 Code: github.com/balrog-ai/BA...
No more excuses about saturated or lack of Agentic LLM/VLM benchmarks. BALROG is here!
🔗 Website with leaderboard: balrogai.com
📰 Paper: arxiv.org/abs/2411.13543
📜 Code: github.com/balrog-ai/BA...
No more excuses about saturated or lack of Agentic LLM/VLM benchmarks. BALROG is here!
This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔
They’re nowhere near capable enough yet!
This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔
They’re nowhere near capable enough yet!
🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!
🧠 Curious about how your model stacks up? Submit now!
🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!
🧠 Curious about how your model stacks up? Submit now!
✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies
It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀
✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies
It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀
BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch
BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch