NotSergeyLevine
notsergeylevine.bsky.social
NotSergeyLevine
@notsergeylevine.bsky.social
Bringing the sergey posts until he does it himself.

Robotics. Reinforcement learning. AI.
To learn more about FAST, check out our blog post: pi.website/research/fast
For the full paper, see: pi.website/download/fas...
For more Pi research, see:
FAST: Efficient Robot Action Tokenization
Physical Intelligence is bringing general-purpose AI into the physical world.
pi.website
January 24, 2025 at 11:40 PM
x.com/i/status/187...

We are releasing the FAST tokenizer that we pre-trained on 1M robot action sequences. In our experiments it works well for tokenizing actions from many different kinds of robots. And it’s easy to use!

cdn.bsky.app/img/feed_thu...
x.com
x.com
January 24, 2025 at 11:39 PM
FAST policies also follow language well and allow us to train the first generalist policies that can perform tasks out of the box in new environments, simply by prompting them in natural language
January 24, 2025 at 11:39 PM
Compared to prior state-of-the-art VLAs like our own pi0 model, FAST policies train 5x faster – what used to take weeks can now be trained in days! 🦾
January 24, 2025 at 11:39 PM
Our FAST tokenizer uses the same techniques as JPEG compression to create compressed action tokens, which enable us to solve complicated tasks that could previously only be tackled with diffusion, like folding laundry, cleaning tables etc.

Blog (and paper + code): pi.website/research/fast
FAST: Efficient Robot Action Tokenization
Physical Intelligence is bringing general-purpose AI into the physical world.
pi.website
January 24, 2025 at 11:38 PM
With FAST, we can train dexterous generalist policies via simple next token prediction, and get a 5x training speed-up over prior state of the art!
January 24, 2025 at 11:35 PM
We evaluate this on connecting connectors (including new connectors), moving objects, etc.

For more, see the website: generalist-distillation.github.io

w/
@CharlesXu0124

@qiyang_li

@jianlanluo
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
generalist-distillation.github.io
December 13, 2024 at 4:39 PM
performance of the RL policies. We show that this works much better than using teleoperation data (even at the same success rate!), and also allows training VLAs with a mix of RL and human data.

pbs.twimg.com/media/GeisF4...
pbs.twimg.com
December 13, 2024 at 4:38 PM
The main idea is to train specialized RL policies for a few tasks (e.g., a few different connectors, etc.), and then use these policies to autonomously get data to train one generalist VLA. This VLA then generalizes more broadly than the individual RL policies, while still retaining the superhuman
December 13, 2024 at 4:38 PM
This turns out to be much better than prior offline -> online methods that need to keep using pessimistic updates, because they are retaining the offline data. Empirical performance of WSRL is very good, even though it's so simple.
December 11, 2024 at 3:03 PM
Our method, WSRL (warm-start RL) is very simple: pretrain a policy and value function with an offline RL method, and then "warm up" the replay buffer in the online phase, collecting some data before starting to train with a *regular* online RL method.
December 11, 2024 at 3:02 PM
Prior methods for offline RL with online RL finetuning generally break down if we don't retain the offline data -- essentially, the offline data is needed to "support" the knowledge from offline training, and if we remove it, the methods quickly collapse in the online phase.
December 11, 2024 at 3:02 PM
We can theoretically prove that this leads to a bound on Q-values. We can then apply this method to train Transformer Q-functions for language modeling and dialogue, robotic control, and a variety of LLM and VLM tasks.

For more, check out the paper here: arxiv.org/abs/2411.05193
December 5, 2024 at 2:49 AM

The equations look a bit more complicated than it really is, this is the method:
December 5, 2024 at 2:48 AM
So the TL;DR is: weighted cross-entropy loss on each token with label smoothing can train the probabilities to approximately represent Q-values!

This means that greedily decoding actually leads to the greedy Q-value maximizing policy.
December 5, 2024 at 2:48 AM
Of course this has a problem: probabilities sum to 1, but Q-values do not. We can rescale the reward and get Q-values in the range [0, 1], and then take the remainder and spread it evenly across all of the other actions. This simply corresponds to a weighted cross entropy loss with label smoothing!
December 5, 2024 at 2:48 AM