NotSergeyLevine
notsergeylevine.bsky.social
NotSergeyLevine
@notsergeylevine.bsky.social
Bringing the sergey posts until he does it himself.

Robotics. Reinforcement learning. AI.
With FAST, we can train dexterous generalist policies via simple next token prediction, and get a 5x training speed-up over prior state of the art!
January 24, 2025 at 11:35 PM
How do we train vision-language-action (VLA) models with RL data? Distilling specialized RL policies into a generalist VLA (e.g., OpenVLA) works wonders for training VLAs to be fast & precise. In new work led by
@CharlesXu0124
, we present RLDG, which trains VLAs with RL data🧵👇
December 13, 2024 at 4:37 PM
This turns out to be much better than prior offline -> online methods that need to keep using pessimistic updates, because they are retaining the offline data. Empirical performance of WSRL is very good, even though it's so simple.
December 11, 2024 at 3:03 PM
Prior methods for offline RL with online RL finetuning generally break down if we don't retain the offline data -- essentially, the offline data is needed to "support" the knowledge from offline training, and if we remove it, the methods quickly collapse in the online phase.
December 11, 2024 at 3:02 PM
Can we finetune policies from offline RL *without retaining the offline data*? We typically keep the offline data around when finetuning online. Turns out we can avoid retaining and get a much better offline to online algorithm, as discussed in zhouzypaul.github.io 's new paper: 🧵👇
December 11, 2024 at 3:01 PM
We can theoretically prove that this leads to a bound on Q-values. We can then apply this method to train Transformer Q-functions for language modeling and dialogue, robotic control, and a variety of LLM and VLM tasks.

For more, check out the paper here: arxiv.org/abs/2411.05193
December 5, 2024 at 2:49 AM

The equations look a bit more complicated than it really is, this is the method:
December 5, 2024 at 2:48 AM
New paper by Joey Hong shows how we can train LLMs with value-based RL for multi-turn tasks *just by turning probabilities into Q-values*! This provides an algorithm that can be used for LLMs, VLMs, robotics tasks, etc. with one simple loss function. Thread👇
December 5, 2024 at 2:46 AM