vyeevani.github.io
Problem: no data
Solution:
Bootstrap intelligence from vlm
1. Start with off the shelf vlm
2. Collect rollouts from code as policy from vlm for a set of tasks
3. GRPO over rollouts
4. Goto 2
5. Offline RL over vlm for direct obs->act
Problem: no data
Solution:
Bootstrap intelligence from vlm
1. Start with off the shelf vlm
2. Collect rollouts from code as policy from vlm for a set of tasks
3. GRPO over rollouts
4. Goto 2
5. Offline RL over vlm for direct obs->act
1. Start with weak base + problems that range from really simple to really hard
3. Sort problems by how well model does on them
4. Pick 75% problems model can do, 25% it can’t. RL with GRPO. Use 10x rollouts + high temp on 25%
5. Repeat step 3 till all problems solved
1. Start with weak base + problems that range from really simple to really hard
3. Sort problems by how well model does on them
4. Pick 75% problems model can do, 25% it can’t. RL with GRPO. Use 10x rollouts + high temp on 25%
5. Repeat step 3 till all problems solved