Lightnews — Scholar-powered news

Seth Karten

@sethkarten.ai

In the NeurIPS PokeAgent Challenge, we stress-test 4 ranking systems across (100k+ agent matches):
- Bradley-terry (batch MLE, our ground truth)
- Elo (online, chess-standard)
- Glicko-1 (online, uncertainty-aware)
- GXE: (Glicko-derived win %)

(2/5)

Leaderboard of Pokemon Gen 1 OU Top 100 NeurIPS competition for the PokeAgent Challenge. The leaderboard shows username, elo, glicko-1, glicko-1 deviation, wins, losses, and ties for the results of the head to head battles for each agent methodology. Highlighted are top user submissions. PAC-MM-* usernames are organizer hosted baselines.

Leaderboard of Pokemon Gen 1 OU Top 100 NeurIPS competition for the PokeAgent Challenge on the pokeagent.github.io website. The leaderboard shows username, history rating, GXE, wins, losses for the results of the head to head battles for each agent methodology, including showing the currently qualifying methods.

October 20, 2025 at 3:50 AM

Seth Karten

@sethkarten.ai

A benchmark environment is nothing without data so you can pretrain before you RL.

Announcing our replay archive preview: We are releasing an additional 25k games to help you train a metagame exploiter (5 million more released after qualifier)

replays.pokeagentshowdown. com:8443/
(3/3)

October 15, 2025 at 5:50 PM

Seth Karten

@sethkarten.ai

Pokemon is truly the pareto frontier of agent research
- The RPG requires an autonomous embodied agentic agent with perception, planning, memory, and control
- VGC and Gen 9 OU penalize erroneous actions with fast-paced opponent-modeling in short games
(1/3)

October 15, 2025 at 5:50 PM

Seth Karten

@sethkarten.ai

If you arent paying attention, we are in a rapidly shifting period of ML paper culture. ICLR/ICML/NeurIPS are being treated as random, out of touch processes with more and more unnecessary work to submit
Most people are saying TMLR is the only good alternative, but are skeptical

September 24, 2025 at 2:31 PM

Seth Karten

@sethkarten.ai

🚨 Hackathon Weekend! 🚨

Jumpstart your PokéAgent Challenge submission ahead of NeurIPS!

📅 Sept 13–14
✅ Leaderboards reset Sat 10AM EDT
🎙️ Lightning talks in LLMs, RL, and Pokemon
💬 Live Office hours
🏆 $2k in prizes

PokéAgent Challenge @ NeurIPS 2025 Hackathon Weekend Schedule. Saturday, Sept 13th: 10 AM leaderboards reset; 12–1:30 PM livestream talks (overview, Aaron Traylor on Pokémon as an AI Problem, Seth Karten on Pokéchamp, Jake Grigsby on Metamon, plus more). Sunday, Sept 14th: 1–3:30 PM organizer office hours; 11:59 PM top teams earn up to $2k in GCP credits. Sponsored by Google DeepMind and AIJ.

September 2, 2025 at 1:44 PM

Seth Karten

@sethkarten.ai

The solution would generalize to another two player partially observable turn-based text game. The most bespoke items are tools, but there has been work recently that shows that you can make these tools modular LLM calls, further increasing generality

August 20, 2025 at 5:47 PM

Seth Karten

@sethkarten.ai

August 12, 2025 at 3:53 PM

Seth Karten

@sethkarten.ai

Papers are dead. Maybe it is time to start the youtube channel…

August 12, 2025 at 6:43 AM

Seth Karten

@sethkarten.ai

Democratic alignment: in a special case, periodic citizen voting can fire the planner. Leader turnover keeps welfare high and prevents policy drift—central nudging plus decentralized oversight in one sandbox.

Timeline plot: planner changes each tax year; welfare remains elevated across turnovers.

July 23, 2025 at 5:30 PM

Seth Karten

@sethkarten.ai

Centralized nudging: the planner’s marginal taxes beat U.S. statutory rates and approach Saez on aggregate welfare (almost double vs baseline).

Bar chart comparing social welfare for U.S. baseline, LLM planner, Saez schedule; LLM nearly matches Saez.

July 23, 2025 at 5:30 PM

Seth Karten

@sethkarten.ai

Synthetic behavioral policies → we sample workers from 2023 ACS skills & demographics, then let each agent verify its own bounded rational utility from individualized preferences, enabling counterfactual reasoning.

July 23, 2025 at 5:30 PM

Seth Karten

@sethkarten.ai

🚀 New preprint!
🤔 Can one agent “nudge” a synthetic civilization of Census‑grounded agents toward higher social welfare—all by optimizing utilities in‑context? Meet the LLM Economist ↓

Diagram of LLM Economist: left—grid of persona‑conditioned worker agents; center—planner LLM sends tax schedule; right—social‑welfare ‘hill‑climb’.

July 23, 2025 at 5:30 PM

Seth Karten

@sethkarten.ai

Open review doesnt seem public yet but here are the titles

July 18, 2025 at 8:05 PM

Seth Karten

@sethkarten.ai

🚀 Launch day! The NeurIPS 2025 PokéAgent Challenge is live. @neuripsconf.bsky.social
Two tracks:
① Showdown Battling – imperfect-info, turn-based strategy
② Pokemon Emerald Speedrunning – long horizon RPG planning
5 M labeled replays • starter kit • baselines.
Bring your LLM, RL, or hybrid agent!

Banner reading “PokéAgent Challenge @ NeurIPS 2025” with two panels: Track 1 – Competitive Pokémon Battle Bots, Track 2 – Long-Horizon RPG Gameplay. Call-to-action: “Create video-game AI! Win prizes! Live now at pokeagent.github.io.”

July 14, 2025 at 4:33 PM

Seth Karten

@sethkarten.ai

🚀 5 days until my ICML spotlight poster!
Key insights we’ll unpack:
• Base LLM + test-time planning
• Game-theoretic scaffolding
• Context-engineered opponent prediction
• Comparative LLM-as-judge (relative > absolute)

Catch me Thu Jul 17, 4:30-7 PM PT👇

July 12, 2025 at 6:12 PM

Seth Karten

@sethkarten.ai

Social media takeoff is hard. Bluesky still lacks the capability to compete with twitter

June 4, 2025 at 5:43 PM

Seth Karten

@sethkarten.ai

Excited to announce that I will be spending the summer at @Waymo on the simulation realism team! I’ll be working on learning to generate simulated worlds.
🚙🚙🚙
Send me a message if youre in the bay area and want to chat!

May 30, 2025 at 4:42 PM

Seth Karten

@sethkarten.ai

Excited to share that the PokeAgent challenge was accepted as a NeurIPS competition!

This should serve as an excellent benchmark for competitive games AND ‘speedrunning’ the RPG. I hope to see both the RL and LLM agent communities working together here to eval agents in Pokemon

More info soon👀

NeurIPS 2025 competition track submission summary, scores, and recommendation to accept

May 26, 2025 at 7:55 PM

Seth Karten

@sethkarten.ai

What happens to TRI though? I thought they had an AV division. Also Toyotas arent EVs so I am confused how the driving tech stack would work. I think Waymo is just diversifying their risk into personal vehicles

April 29, 2025 at 11:59 PM

Seth Karten

@sethkarten.ai

Insane new study from zurich studies the influence of the LLMs for persuasion on the r/ChangeMyView subreddit. Let's just say people are outraged...
Is the study justified since bots are already rampant on reddit?
Or does this cross ethical lines?

April 28, 2025 at 7:00 PM

Seth Karten

@sethkarten.ai

Stay tuned for our upcoming dataset release (3M+ ranked human games)!

March 7, 2025 at 3:47 PM

Seth Karten

@sethkarten.ai

How does PokéChamp perform in competitive Pokémon battles?
Our agent, powered by GPT-4, achieves a win rate of 84% against Abyssal (best rule-based bot) and a local Elo of 1268, outperforming all baselines, including other LLM-based agents and traditional methods!

March 7, 2025 at 3:47 PM

Seth Karten

@sethkarten.ai

PokéChamp uses an LLM to decide to use one-step lookahead tools for domain-specific calculations (e.g., damage calcs) or a small-scale minimax search enhanced by action sampling, opponent modeling, and value function estimation.
Result: Expert-level play at human speed!

March 7, 2025 at 3:47 PM

Seth Karten

@sethkarten.ai

Why Pokémon? It's the perfect testbed for LLM agents: constantly changing ruleset, partial observability, and rich strategic depth.
PokéChamp leverages an LLM for action sampling, opponent modeling and value estimation, with no domain-specific training required

March 7, 2025 at 3:47 PM

Seth Karten

@sethkarten.ai

Can a Large Language Model (LLM) with zero Pokémon-specific training achieve expert-level performance in competitive Pokémon battles?
Introducing PokéChamp, our minimax LLM agent that reaches top 30%-10% human-level Elo on Pokémon Showdown!
New paper on arXiv and code on github!

March 7, 2025 at 3:47 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news