Lightnews — Scholar-powered news

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

Reinforcement Learning PhD @NUSingapore | Undergrad @PKU1898 | Building autonomous decision making systems | Ex intern @MSFTResearch @deepseek_ai | DeepSeek-V2, DeepSeek-VL, DeepSeek-Prover

Posts Replies Media Videos

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

New paradigm: instead of curating problems, create environments where models discover reasoning through competition.
Self-play = autonomous improvement without human supervision. Simple games improve general reasoning!

July 1, 2025 at 8:11 PM

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

We developed Role-conditioned Advantage Estimation (RAE) to stabilize training.
Without RAE: "thinking collapse" - responses crash 3500→0 chars, math drops 66%
RAE keeps reasoning alive!

July 1, 2025 at 8:11 PM

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

Multi-game magic:
Single game: ~41% reasoning average
Multi-game: 42.7% - skills synergize!
Even strong models improve:
DeepSeek-R1-Distill-Qwen-7B jumps 59.7%→61.7%. AIME'25 +10 points! 📈

July 1, 2025 at 8:11 PM

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

Different games → different skills:
TicTacToe → spatial (56% on Snake)
Kuhn Poker → probabilistic (91.7% on Pig Dice!)
Simple Negotiation → strategic (55.8% on Truth & Deception)
Each game develops distinct abilities!

July 1, 2025 at 8:11 PM

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

Why self-play? We compared approaches:
Self-play: 39.7% math, 47.8% general reasoning
Fixed opponents: Much worse
Random: Complete collapse
Key: as you improve, so does your opponent. Fixed opponents become too easy.

July 1, 2025 at 8:11 PM

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

To understand poker→math transfer, we found 3 patterns:
📊 Expected Value Calculation
🔍 Case-by-Case Analysis
🎯 Pattern Recognition
These patterns from games transfer to math benchmarks. Games teach generalizable thinking!

July 1, 2025 at 8:11 PM

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

We're excited about self-play unlocking continuously improving agents. RL selects CoT patterns from LLMs. Games=perfect testing grounds.
SPIRAL: models learn via self-competition. Kuhn Poker → +8.7% math, +18.1 Minerva Math! 🃏
Paper: huggingface.co/papers/2506....
Code: github.com/spiral-rl/spiral

July 1, 2025 at 8:11 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news