Lightnews — Scholar-powered news

Harshit Sikchi

@harshitsikchi.bsky.social

This was a collaborative work with Siddhant Agarwal,
@pranayajajoo.bsky.social
, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, @scottniekum.bsky.social .Also, a joint effort across universities:
@texasrobotics.bsky.social
, UMass Amherst, U Alberta

December 11, 2024 at 7:12 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(10/n) For the first time to our knowledge, we present a zero-shot end-to-end unsupervised algorithm that gives a pathway from language to low-level control.
Check out the work here for more details:
Paper: arxiv.org/abs/2412.05718
Website: hari-sikchi.github.io/rlzero/

RL Zero: Zero-Shot Language to Behaviors without any Supervision

Rewards remain an uninterpretable way to specify tasks for Reinforcement Learning, as humans are often unable to predict the optimal behavior of any given reward function, leading to poor reward desig...

arxiv.org

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(9/n) For instance,
a) future approaches can initialize a behavior instantly by prompting for later finetuning,
b) Or come up with approaches to plan in lang. space and translate each instruction to low-level control
c) With gen. video models getting better (e.g. Sora) RLZero will only get better.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(8/n) Zero-shot = no inference time training (no costly/unsafe RL training during inference)
+
Unsupervised = no costly dataset labeling (a big issue for robotics!)
is a promising recipe for scaling up robot learning.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(7/n) This project is close to my heart as it realizes a dream I shared with @scottniekum.bsky.social when I started my PhD to go beyond the limitation of matching observations in imitation learning rather than capturing the semantic understanding of what doing a task means.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(6/n) With RLZero, you can just pass in a YouTube video and ask an agent to mimic the behavior instantly. This brings us closer to true zero-shot cross-embodiment transfer.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(5/n) RLZero’s Prompt to Policy: Asking a humanoid agent to perform a headstand.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(4/n) Reward is an inconvenient and easily hackable form of task specification. Now, we can prompt and obtain behaviors zero-shot with language. Example: Asking a walker agent to perform a cartwheel.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(3/n) Given a text prompt, RL Zero imagines 🧠 the expected behavior of the agent using generative video models. The imaginations are projected and grounded to the observations that the agent has encountered in the past. Finally, zero-shot imitation learning converts the grounded obs into a policy.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(2/n) RL Zero enables prompt-to-policy generation, and we believe this unlocks new capabilities in scaling up language-conditioned RL, providing an interpretable link between RL agents and humans and achieving true cross-embodiment transfer.

December 11, 2024 at 7:11 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(7/n) This project is close to my heart as it realizes a dream I shared with @scottniekum.bsky.social when I started my PhD to go beyond the limitation of matching observations in imitation learning rather than capturing the semantic understanding of what doing a task means.

December 10, 2024 at 8:14 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(6/n) With RLZero, you can just pass in a YouTube video and ask an agent to mimic the behavior instantly. This brings us closer to true zero-shot cross-embodiment transfer.

December 10, 2024 at 8:14 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(5/n) RLZero’s Prompt to Policy: Asking a humanoid agent to perform a headstand.

December 10, 2024 at 8:14 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(4/n) Reward is an inconvenient and easily hackable form of task specification. Now, we can prompt and obtain behaviors zero-shot with language. Example: Asking a walker agent to perform a cartwheel.

December 10, 2024 at 8:14 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(3/n) Given a text prompt, RL Zero imagines 🧠 the expected behavior of the agent using generative video models. The imaginations are projected and grounded to the observations that the agent has encountered in the past. Finally, zero-shot imitation learning converts the grounded obs into a policy.

December 10, 2024 at 8:14 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(2/n) RL Zero enables prompt-to-policy generation, and we believe this unlocks new capabilities in scaling up language-conditioned RL, providing an interpretable link between RL agents and humans and achieving true cross-embodiment transfer.

December 10, 2024 at 8:14 AM

Harshit Sikchi

@harshitsikchi.bsky.social

5/5) Paper link: arxiv.org/abs/2411.19418

Proto Successor Measure: Representing the Space of All Possible Solutions of Reinforcement Learning

Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment. Referred to as "zero-shot learning," this ability remain...

arxiv.org

December 3, 2024 at 12:33 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(4/5) Our idea draws inspiration from the Linear Programming view of RL that focuses on visitations as the primary optimization object and has also recently led to new developments in RL algorithms (arxiv.org/abs/2302.08560)

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of state-...

arxiv.org

December 3, 2024 at 12:33 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(3/5) We give an efficient algorithm to learn such basis, and once these are learned as a part of pretraining, inference amounts to solving a simple linear program. This allows PSM to do zero-shot RL in a way that is more performant and stable than baselines.

December 3, 2024 at 12:33 AM

Harshit Sikchi

@harshitsikchi.bsky.social

(2/5) Our work, Proto-Successor Measures (PSM), shows that valid successor measures form an affine set. PSM learns a basis of the affine set where the dimensionality of the basis controls the compression of MDP (or the information lost). After all, learning is compression.

December 3, 2024 at 12:33 AM

Harshit Sikchi

@harshitsikchi.bsky.social

we should catch up if you are available!

December 1, 2024 at 12:42 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news