Lightnews — Scholar-powered news

Davide Paglieri

@dpaglieri.bsky.social

260 followers 43 following 13 posts

PhD Student at UCL.
Previously AI Research Engineer at Bending Spoons

Posts Replies Media Videos

Davide Paglieri

@dpaglieri.bsky.social

🚨This week's new entry on balrogai.com is Microsoft Phi-4 (14B model)

While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.

January 16, 2025 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

🚨BALROG leaderboard update

This week's new entries on balrogai.com are:

Llama 3.3 70B Instruct 🫤
Claude 3.5 Haiku✨
Mistral-Nemo-it (12B) 🆗

Github: github.com/balrog-ai/BA...

December 12, 2024 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

It's great to see BALROG featured on Jack Clark's Import AI newsletter!

Check out what he had to say about it here:
jack-clark.net

And check out BALROG's leaderboard on balrogai.com

December 4, 2024 at 9:37 AM

Davide Paglieri

@dpaglieri.bsky.social

The ultimate test? NetHack 🏰

This beast remains unsolved: the best model, o1-preview, achieved just 1.5% average progression. BALROG pushes boundaries, uncovering where LLMs/VLMs struggle the most. Will your model fare better? 🤔

They’re nowhere near capable enough yet!

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

And the results are in!

🤖 GPT-4o leads the pack in LLM performance
👁️ Claude 3.5 Sonnet shines as the top VLM
📈 LLaMA models show scaling laws from 1B to 70B, holding their own impressively!

🧠 Curious about how your model stacks up? Submit now!

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

What makes BALROG unique?

✅Easy evaluation for LLM/VLM agents locally or via popular APIs
✅Highly parallel, efficient setup
✅Supports zero-shot eval & more complex strategies

It’s plug-and-play for anyone benchmarking LLMs/VLMs. 🛠️🚀

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

BALROG brings together 6 challenging RL environments, including Crafter, BabaIsAI and the notoriously challenging NetHack.

BALROG is designed to give meaningful signal for both weak and strong models, making it a game-changer for the wider AI community. 🕹️ #AIResearch

November 21, 2024 at 4:24 PM

Davide Paglieri

@dpaglieri.bsky.social

Tired of saturated benchmarks? Want scope for a significant leap in capabilities?

🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!

BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.

1/🧵

November 21, 2024 at 4:24 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news