Nick Tomlin
nickatomlin.bsky.social
Nick Tomlin
@nickatomlin.bsky.social
Incoming assistant professor at TTIC, current faculty fellow at NYU, and previous PhD student at Berkeley. Natural language processing. He/him.

🌐 nickatomlin.github.io
This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games
May 13, 2025 at 9:30 PM
We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines
May 13, 2025 at 9:30 PM
I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games

📄: arxiv.org/abs/2505.07215
May 13, 2025 at 9:30 PM