Ankur Handa
ankurhandos.bsky.social
Ankur Handa
@ankurhandos.bsky.social
Training robots in simulation.
This work was mainly a collaboration for GTC so it all came together quickly in 2 months and we didn't want to change much :)

We are working on improving the system and will release a tech report in a few months.
April 29, 2025 at 3:09 AM
*us being lazy not "using" :)
April 29, 2025 at 1:33 AM
The robot is rewarded to lift the object beyond a certain height to ensure that the grasp is stable. So it lifts it first and takes it to a certain height and then does the dropping. This was using being lazy for not changing the reward - vestigial stuff. The lift reward here.
April 29, 2025 at 1:33 AM
Stereo camera images that the networks use as input. They go directly into the network without any pre-processing and out comes action that is sent to the robot as target.
April 29, 2025 at 12:54 AM
*cupola
April 7, 2025 at 2:31 AM
I have been using cursor and it's great. It's a vscode fork so I like it as I have been using vscode for years now.
February 13, 2025 at 4:37 AM
Just remembered this session "Discussion for Direct versus Features Session" link.springer.com/chapter/10.1... that I cited in my phd thesis but can't download now as it is behind paywall. But I remember it being very interesting and fun to read.
February 12, 2025 at 5:35 AM
The message here is you should try to stay as close to raw pixels as possible - it just works out much better in the long run.

I love this hacker news comment that I saw on twitter few years ago.
February 11, 2025 at 3:57 AM
Michal Irani, Michael Black, Padmanabham Anandan, Rick Szeliski, Harpreet Sawhney were all looking at recovering camera pose transformation directly from image pixels. And the tracking in ARIA glasses is uses "direct".
February 11, 2025 at 3:55 AM
Great work by my colleagues Ritvik Singh, Karl Van Wyk, Arthur Allshire and Nathan Ratliff.
.
February 10, 2025 at 5:04 AM
This is the next one in the line in our dex-series of work where we started off with pose estimation as the representation of the object and gradually moved towards more general end-to-end image based direct pixels to action mapping.
February 10, 2025 at 5:03 AM
When doing distillation, we also regress to the location of the object, which serves as a diagnostic tool for us to see what the network is predicting and use it for any state machine on top.
February 10, 2025 at 5:03 AM
Our (stereo) vision network takes inspiration from the dust3r/mast3r work (with no explicit epipolar geometry imposed) where image embeddings are passed to a transformer with cross attention.
February 10, 2025 at 5:02 AM
This approach differs from the common two-stage pipeline, where grasping or pick location is first regressed and then followed by motion planning. Instead, it integrates both stages into a single process and trains the entire system end-to-end using RL.
February 10, 2025 at 5:02 AM
The benefits of training with parallel tiled rendering in simulation are still underappreciated. With modern tools like Scene Synthesizer and ControlNets, which transform synthetic images into photorealistic ones, the value of simulation-based training will only continue to grow.
February 10, 2025 at 5:01 AM
Depth sensors are noisy and haven’t seen major improvements in a while and pure vision-based systems have caught up and most frontier models today use raw rgb pixels. We always wanted to move towards direct rgb based control and this work is our first attempt at doing so.
February 10, 2025 at 5:01 AM
We train a teacher via RL with state vectors and student via distillation on images for learning control for 23 DoF multi-fingered hand and arm system. Doing end-to-end with real data as in BC-like systems is hard already but doing end-to-end with simulation is even harder due to the sim-to-real gap
February 10, 2025 at 5:01 AM
Can you share the other ones that you liked?
November 20, 2024 at 7:25 PM
Why were you not convinced before? Is it because it uses images and did a lot more scaling than the previous work?
November 20, 2024 at 6:42 PM