Lightnews — Scholar-powered news

Siqi Liu (刘思奇)

@liusiqi.bsky.social

74 followers 150 following 3 posts

Staff Research Engineer @ DeepMind

Posts Replies Media Videos

Siqi Liu (刘思奇)

@liusiqi.bsky.social

We have got exciting (and unconventional) stuff cooking and we are hiring for a strong research engineer on the GDM Game Theory team in London.

Consider apply if you are interested in the intersection of game theory, multiagent systems and LLMs!
job-boards.greenhouse.io/deepmind/job...

Research Engineer, Game Theory & Multi-Agent Systems

London, UK

job-boards.greenhouse.io

September 27, 2025 at 9:16 AM

Siqi Liu (刘思奇)

@liusiqi.bsky.social

Frontier models are often compared on crowdsourced user prompts - user prompts can be low-quality, biased and redundant, making "performance on average" hard to trust.

Come find us at #ICLR2025 to discuss game-theoretic evaluation (shorturl.at/0QtBj)! See you in Singapore!

Re-evaluating Open-Ended Evaluation of Large Language Models

A case study using the livebench.ai leaderboard.

shorturl.at

April 18, 2025 at 4:34 PM

Reposted by Siqi Liu (刘思奇)

Luke Marris

@lukemarris.bsky.social

[🧵1/N] Thrilled to share our work "Re-evaluating Open-Ended Evaluation of Large Language Models"! 🚀 Popular LLM leaderboards (think Elo/Chatbot Arena) are useful, but are they telling the whole story? We find issues w/ redundancy & bias. 🤔
Paper @ ICLR 2025: arxiv.org/abs/2502.20170 #LLM #ICLR2025

April 17, 2025 at 4:12 PM

Reposted by Siqi Liu (刘思奇)

Jeff Dean

@jeffdean.bsky.social

🥁Introducing Gemini 2.5, our most intelligent model with impressive capabilities in advanced reasoning and coding.

Now integrating thinking capabilities, 2.5 Pro Experimental is our most performant Gemini model yet. It’s #1 on the LM Arena leaderboard. 🥇

March 25, 2025 at 5:25 PM

Reposted by Siqi Liu (刘思奇)

Luke Marris

@lukemarris.bsky.social

[🧵1/N] Please check out our new paper (arxiv.org/abs/2502.11645) on game-theoretic evaluation. It is the first method that results in clone-invariant ratings in N-player, general-sum interactions. Co-authors: @liusiqi.bsky.social , Ian Gemp, Georgios Piliouras, @sharky6000.bsky.social 🎉

Deviation Ratings: A General, Clone-Invariant Rating Method

Many real-world multi-agent or multi-task evaluation scenarios can be naturally modelled as normal-form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These...

arxiv.org

February 18, 2025 at 10:49 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news