Siqi Liu (刘思奇)
liusiqi.bsky.social
Siqi Liu (刘思奇)
@liusiqi.bsky.social
Staff Research Engineer @ DeepMind
We have got exciting (and unconventional) stuff cooking and we are hiring for a strong research engineer on the GDM Game Theory team in London.

Consider apply if you are interested in the intersection of game theory, multiagent systems and LLMs!
job-boards.greenhouse.io/deepmind/job...
Research Engineer, Game Theory & Multi-Agent Systems
London, UK
job-boards.greenhouse.io
September 27, 2025 at 9:16 AM
Frontier models are often compared on crowdsourced user prompts - user prompts can be low-quality, biased and redundant, making "performance on average" hard to trust.

Come find us at #ICLR2025 to discuss game-theoretic evaluation (shorturl.at/0QtBj)! See you in Singapore!
Re-evaluating Open-Ended Evaluation of Large Language Models
A case study using the livebench.ai leaderboard.
shorturl.at
April 18, 2025 at 4:34 PM
Reposted by Siqi Liu (刘思奇)
[🧵1/N] Thrilled to share our work "Re-evaluating Open-Ended Evaluation of Large Language Models"! 🚀 Popular LLM leaderboards (think Elo/Chatbot Arena) are useful, but are they telling the whole story? We find issues w/ redundancy & bias. 🤔
Paper @ ICLR 2025: arxiv.org/abs/2502.20170 #LLM #ICLR2025
April 17, 2025 at 4:12 PM
Reposted by Siqi Liu (刘思奇)
🥁Introducing Gemini 2.5, our most intelligent model with impressive capabilities in advanced reasoning and coding.

Now integrating thinking capabilities, 2.5 Pro Experimental is our most performant Gemini model yet. It’s #1 on the LM Arena leaderboard. 🥇
March 25, 2025 at 5:25 PM
Reposted by Siqi Liu (刘思奇)
[🧵1/N] Please check out our new paper (arxiv.org/abs/2502.11645) on game-theoretic evaluation. It is the first method that results in clone-invariant ratings in N-player, general-sum interactions. Co-authors: @liusiqi.bsky.social , Ian Gemp, Georgios Piliouras, @sharky6000.bsky.social 🎉
Deviation Ratings: A General, Clone-Invariant Rating Method
Many real-world multi-agent or multi-task evaluation scenarios can be naturally modelled as normal-form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These...
arxiv.org
February 18, 2025 at 10:49 AM