Lightnews — Scholar-powered news

Nick Tomlin

@nickatomlin.bsky.social

1.7K followers 110 following 11 posts

Incoming assistant professor at TTIC, current faculty fellow at NYU, and previous PhD student at Berkeley. Natural language processing. He/him.

🌐 nickatomlin.github.io

Posts Replies Media Videos

Nick Tomlin

@nickatomlin.bsky.social

This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games

Results table. The best model (o1) wins about 36% of games against the RL baselines.

May 13, 2025 at 9:30 PM

Nick Tomlin

@nickatomlin.bsky.social

We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines

Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents

May 13, 2025 at 9:30 PM

Nick Tomlin

@nickatomlin.bsky.social

I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games

📄: arxiv.org/abs/2505.07215

Title and abstract of the paper, "Measuring General Intelligence with Generated Games"

May 13, 2025 at 9:30 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news