Lightnews — Scholar-powered news

Luke Marris

@lukemarris.bsky.social

690 followers 170 following 24 posts

Research Engineer at Google DeepMind.
Interests in game theory, reinforcement learning, and deep learning.

Website: https://www.lukemarris.info/
Google Scholar: https://scholar.google.com/citations?user=dvTeSX4AAAAJ

Posts Replies Media Videos

Luke Marris

@lukemarris.bsky.social

[🧵9/N] And, an interactive demo is available here: siqi.fr/public/re-ev...

Re-evaluating Open-Ended Evaluation of Large Language Models

A case study using the livebench.ai leaderboard.

siqi.fr

April 22, 2025 at 3:48 PM

Luke Marris

@lukemarris.bsky.social

😅😂 Called out!

April 17, 2025 at 5:37 PM

Luke Marris

@lukemarris.bsky.social

[🧵8/N] Come see our poster on 2025/04/24 at Poster Location: Hall 3 + Hall 2B #440.
iclr.cc/virtual/2025... #IRL

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵7/N] Big thanks to the team @GoogleDeepMind! Siqi Liu (@liusiqi.bsky.social), Ian Gemp (@drimgemp.bsky.social), Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot (@sharky6000.bsky.social)

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵6/N] In summary: Current open-ended LLM evals risk being brittle. Our game-theoretic framework w/ affinity entropy provides more robust, intuitive, and interpretable rankings, crucial for guiding real progress! 🧠 Check it out & let us know your thoughts! 🙏
arxiv.org/abs/2502.20170

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵5/N] Does it work? YES! ✅On real data (arena-hard-v0.1), our method provides intuitive rankings robust to redundancy. We added 500 adversarial prompts targeting the top model – Elo rankings tanked, ours stayed stable! (See Fig 3 👇). Scales & gives interpretable insights!

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵4/N] But game theory isn't magic - standard methods often yield multiple equilibria & aren't robust to redundancy. Key innovation: We introduce novel solution concepts + 'Affinity Entropy' to find unique, CLONE-INVARIANT equilibria! ✨(No more rank shifts just bc you added copies!)

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵3/N] So, what's our fix? GAME THEORY! 🎲 We reframe LLM evaluation as a 3-player game: a 'King' model 👑 vs. a 'Rebel' model 😈, with a 'Prompt' player selecting tasks that best differentiate them. This shifts focus from 'average' performance to strategic interaction. #GameTheory #Evaluation

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵2/N] Why the concern? Elo averages performance. If prompt sets are biased or redundant (intentionally or not!), rankings can be skewed. 😟 Our simulations show this can even reinforce biases, pushing models to specialize narrowly instead of improving broadly (see skill entropy drop!). 📉 #EloRating

April 17, 2025 at 4:12 PM

Luke Marris

@lukemarris.bsky.social

[🧵13/N] It is also possible to plot each task's contribution to the deviation rating, enabling to quickly see the trade-offs between the models. Negative bars mean worse than equilibrium at that task. So Sonnet is relatively weaker at "summarize" and Llama is relatively weaker at "LCB generation".

February 24, 2025 at 2:00 PM

Luke Marris

@lukemarris.bsky.social

[🧵12/N] We are convinced this is a better approach than Elo or simple averaging. Please read the paper for more details! 🤓

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵11/N] Our work proposes the first rating method, “Deviation Ratings”, that is both dominant- and clone-invariant in fully general N-player, general-sum interactions, allowing us to evaluate general models in a theoretically grounded way. 👏

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵10/N] A three-player game with two-symmetric models players try to beat each other (by playing strong models) on a task selected by task player incentivised to separate models is an improved formulation. 👍 However Nash Averaging is only defined for two-player zero-sum games. 😭

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵9/N] Unfortunately, a two-player zero-sum interaction is limiting. For example, if no model can solve a task, the task player would only play that impossible task, resulting in uninteresting ratings. 🙁

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵8/N] This is hugely powerful for two reasons. 1) When including tasks in the evaluation set one can be maximally inclusive: redundancies are axiomatically ignored which simplifies curation for evaluation. 2) Salient strategies are automatically reweighted according to their significance. 💪

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵7/N] This approach is provably clone- and dominant-invariant: adding copies of tasks and models, or adding dominated tasks and models, does not influence the rating *at all*. The rating is invariant to two types of redundancies! 🤩 Notably, neither an average nor Elo have these properties.

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵6/N] A previous approach, called Nash Averaging (arxiv.org/abs/1806.02643), formulated the problem as a two-player zero-sum game where a model player maximizes performance on tasks by playing strong models and a task player minimises performance by selecting difficult tasks. ♟️

Re-evaluating Evaluation

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and oth...

arxiv.org

February 18, 2025 at 10:49 AM

Luke Marris

@lukemarris.bsky.social

[🧵5/N] Therefore, there is a strategic decision on which tasks are important, and which model is the best. Where there is a strategic interaction, it can be modeled as a game! Model players select models, and task players select tasks. The players may play distributions to avoid being exploited.

February 18, 2025 at 10:49 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news