Lightnews — Scholar-powered news

Davide Paglieri

@dpaglieri.bsky.social

270 followers 43 following 13 posts

PhD Student at UCL.
Previously AI Research Engineer at Bending Spoons

Posts Replies Media Videos

Davide Paglieri

@dpaglieri.bsky.social

🚀 BALROG is open submission! We welcome submission of new foundation models and new agentic pipelines.

Check it out here:
github.com/balrog-ai/BA...

GitHub - balrog-ai/BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Benchmarking Agentic LLM and VLM Reasoning On Games - balrog-ai/BALROG

github.com

January 16, 2025 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

This suggests that high performance on popular static benchmarks does not necessarily translate to dynamic agentic tasks, and training data contamination may also play a role.

🆕BALROG introduces a new type of agentic benchmark designed to be robust to train data contamination.

January 16, 2025 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

Interested in submitting to BALROG? Check out the instructions here!

balrogai.com/submit.html

Some big models we are looking to evaluate:

OpenAI O1
Gemini 2.0 Flash
Grok-2
Llama-3.1-405B
Pixtral-120B
Mistral-Large (123B)

If you have resources to contribute, feel free to reach out!

BALROG

BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games

balrogai.com

December 12, 2024 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

Llama-3.3-70B-it 🫤 -> Not as good as the 3.1-70B version on BALROG's tasks.

Claude 3.5 Haiku✨ -> A little gem, the best of the smaller closed-source models. It even gets 1.1% progression on NetHack! 🏰 Was it trained on NLE? 🤔

Mistral-Nemo-it 🆗 -> Okay for its size (12B)

December 12, 2024 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

🚨 BALROG is LIVE 🚨

🔗 Website with leaderboard: balrogai.com
📰 Paper: arxiv.org/abs/2411.13543
📜 Code: github.com/balrog-ai/BA...

No more excuses about saturated or lack of Agentic LLM/VLM benchmarks. BALROG is here!

BALROG

BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games

balrogai.com

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

The ultimate test? NetHack 🏰

This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔

They’re nowhere near capable enough yet!

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

And the results are in!

🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!

🧠 Curious about how your model stacks up? Submit now!

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

What makes BALROG unique?

✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies

It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

BALROG brings together 6 challenging RL environments, including Crafter, BabaIsAI and the notoriously challenging NetHack.

BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch

November 21, 2024 at 4:24 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news