Abstract representations+reinforcement learning.
2025-01-22 — DeepSeek R1 — arxiv.org/abs/2501.12948
2025-01-22 — Kimi 1.5 — arxiv.org/abs/2501.12599
2025-03-31 — Open-Reasoner-Zero — arxiv.org/abs/2503.24290
2025-04-10 — Seed 1.5-Thinking — arxiv.org/abs/2504.13914
...
2025-01-22 — DeepSeek R1 — arxiv.org/abs/2501.12948
2025-01-22 — Kimi 1.5 — arxiv.org/abs/2501.12599
2025-03-31 — Open-Reasoner-Zero — arxiv.org/abs/2503.24290
2025-04-10 — Seed 1.5-Thinking — arxiv.org/abs/2504.13914
...
The 200 ELO points difference between recent models and a model that is 2 years old means that a human rater has ~75% chance of preferring an answer from a recent model.
Based on available data, all indicators about the progress of AI (in particular LLMs) remain strong.
The 200 ELO points difference between recent models and a model that is 2 years old means that a human rater has ~75% chance of preferring an answer from a recent model.
Based on available data, all indicators about the progress of AI (in particular LLMs) remain strong.
Our Hadamax (Hadamard max-pooling) encoder architecture improves the recent PQN algorithm’s Atari performance by 80%, allowing it to significantly surpass Rainbow-DQN!
Our Hadamax (Hadamard max-pooling) encoder architecture improves the recent PQN algorithm’s Atari performance by 80%, allowing it to significantly surpass Rainbow-DQN!
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.
🔗 magazine.sebastianraschka.com/p/the-state-...
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.
🔗 magazine.sebastianraschka.com/p/the-state-...
- pre-training w next token prediction creates local minima in reasoning we can't escape => pre-training should also be done with RL
- long context windows lead to exploitation of spurious correleations
- disentangle reasoning and knowledge
- pre-training w next token prediction creates local minima in reasoning we can't escape => pre-training should also be done with RL
- long context windows lead to exploitation of spurious correleations
- disentangle reasoning and knowledge
our group is looking to hire within this program, ideally to work on topics related to RL theory. in case you're interested, pls DM or email me.
(retweets appreciated!)
ramonllull-aira.eu/application
our group is looking to hire within this program, ideally to work on topics related to RL theory. in case you're interested, pls DM or email me.
(retweets appreciated!)
ramonllull-aira.eu/application
The core of the approach is reinforcement learning from verifiable rewards. No PRMs / MCTS. R1-zero doesn't even use SFT to start.
The core of the approach is reinforcement learning from verifiable rewards. No PRMs / MCTS. R1-zero doesn't even use SFT to start.
For me, there are three big stories: itcanthink.substack.com/p/2024-robot...
For me, there are three big stories: itcanthink.substack.com/p/2024-robot...
We trained a model to play four games, and the performance in each increases by "external search" (MCTS using a learned world model) and "internal search" where the model outputs the whole plan on its own!
We trained a model to play four games, and the performance in each increases by "external search" (MCTS using a learned world model) and "internal search" where the model outputs the whole plan on its own!
A reminder that the call for workshops is out: rldm.org/call-for-wor...
The workshops are one of my favourite parts of the conference :) please get in touch if you have any questions!
A reminder that the call for workshops is out: rldm.org/call-for-wor...
The workshops are one of my favourite parts of the conference :) please get in touch if you have any questions!