Lightnews — Scholar-powered news

NotSergeyLevine

@notsergeylevine.bsky.social

To learn more about FAST, check out our blog post: pi.website/research/fast
For the full paper, see: pi.website/download/fas...
For more Pi research, see:

FAST: Efficient Robot Action Tokenization

Physical Intelligence is bringing general-purpose AI into the physical world.

pi.website

January 24, 2025 at 11:40 PM

NotSergeyLevine

@notsergeylevine.bsky.social

x.com/i/status/187...

We are releasing the FAST tokenizer that we pre-trained on 1M robot action sequences. In our experiments it works well for tokenizing actions from many different kinds of robots. And it’s easy to use!

cdn.bsky.app/img/feed_thu...

x.com

January 24, 2025 at 11:39 PM

NotSergeyLevine

@notsergeylevine.bsky.social

FAST policies also follow language well and allow us to train the first generalist policies that can perform tasks out of the box in new environments, simply by prompting them in natural language

January 24, 2025 at 11:39 PM

NotSergeyLevine

@notsergeylevine.bsky.social

Compared to prior state-of-the-art VLAs like our own pi0 model, FAST policies train 5x faster – what used to take weeks can now be trained in days! 🦾

January 24, 2025 at 11:39 PM

NotSergeyLevine

@notsergeylevine.bsky.social

Our FAST tokenizer uses the same techniques as JPEG compression to create compressed action tokens, which enable us to solve complicated tasks that could previously only be tackled with diffusion, like folding laundry, cleaning tables etc.

Blog (and paper + code): pi.website/research/fast

FAST: Efficient Robot Action Tokenization

Physical Intelligence is bringing general-purpose AI into the physical world.

pi.website

January 24, 2025 at 11:38 PM

NotSergeyLevine

@notsergeylevine.bsky.social

With FAST, we can train dexterous generalist policies via simple next token prediction, and get a 5x training speed-up over prior state of the art!

January 24, 2025 at 11:35 PM

NotSergeyLevine

@notsergeylevine.bsky.social

We evaluate this on connecting connectors (including new connectors), moving objects, etc.

For more, see the website: generalist-distillation.github.io

w/
@CharlesXu0124

@qiyang_li

@jianlanluo

RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning

generalist-distillation.github.io

December 13, 2024 at 4:39 PM

NotSergeyLevine

@notsergeylevine.bsky.social

performance of the RL policies. We show that this works much better than using teleoperation data (even at the same success rate!), and also allows training VLAs with a mix of RL and human data.

pbs.twimg.com/media/GeisF4...

pbs.twimg.com

December 13, 2024 at 4:38 PM

NotSergeyLevine

@notsergeylevine.bsky.social

The main idea is to train specialized RL policies for a few tasks (e.g., a few different connectors, etc.), and then use these policies to autonomously get data to train one generalist VLA. This VLA then generalizes more broadly than the individual RL policies, while still retaining the superhuman

December 13, 2024 at 4:38 PM

NotSergeyLevine

@notsergeylevine.bsky.social

Paper: arxiv.org/abs/2412.07762
w/ zhouzypaul.github.io
, Andy Peng, @qcli.bsky.social
, aviralkumar2907.github.io

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

The modern paradigm in machine learning involves pre-training on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a di...

arxiv.org

December 11, 2024 at 3:09 PM

NotSergeyLevine

@notsergeylevine.bsky.social

This turns out to be much better than prior offline -> online methods that need to keep using pessimistic updates, because they are retaining the offline data. Empirical performance of WSRL is very good, even though it's so simple.

December 11, 2024 at 3:03 PM

NotSergeyLevine

@notsergeylevine.bsky.social

Our method, WSRL (warm-start RL) is very simple: pretrain a policy and value function with an offline RL method, and then "warm up" the replay buffer in the online phase, collecting some data before starting to train with a *regular* online RL method.

December 11, 2024 at 3:02 PM

NotSergeyLevine

@notsergeylevine.bsky.social

Prior methods for offline RL with online RL finetuning generally break down if we don't retain the offline data -- essentially, the offline data is needed to "support" the knowledge from offline training, and if we remove it, the methods quickly collapse in the online phase.

December 11, 2024 at 3:02 PM

NotSergeyLevine

@notsergeylevine.bsky.social

We can theoretically prove that this leads to a bound on Q-values. We can then apply this method to train Transformer Q-functions for language modeling and dialogue, robotic control, and a variety of LLM and VLM tasks.

For more, check out the paper here: arxiv.org/abs/2411.05193

December 5, 2024 at 2:49 AM

NotSergeyLevine

@notsergeylevine.bsky.social

The equations look a bit more complicated than it really is, this is the method:

December 5, 2024 at 2:48 AM

NotSergeyLevine

@notsergeylevine.bsky.social

So the TL;DR is: weighted cross-entropy loss on each token with label smoothing can train the probabilities to approximately represent Q-values!

This means that greedily decoding actually leads to the greedy Q-value maximizing policy.

December 5, 2024 at 2:48 AM

NotSergeyLevine

@notsergeylevine.bsky.social

Of course this has a problem: probabilities sum to 1, but Q-values do not. We can rescale the reward and get Q-values in the range [0, 1], and then take the remainder and spread it evenly across all of the other actions. This simply corresponds to a weighted cross entropy loss with label smoothing!

December 5, 2024 at 2:48 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news