Fri, Dec 5, 2025
11:00 AM – 2:00 PM PST
Exhibit Hall C,D,E #4505
Pic: (fancy) knots at USS midway museum near SD convention center
Fri, Dec 5, 2025
11:00 AM – 2:00 PM PST
Exhibit Hall C,D,E #4505
Pic: (fancy) knots at USS midway museum near SD convention center
KnotGym gives us a lightweight yet expressive testbed for multi-modal long-horizon reasoning and planning.
📄 Paper: arxiv.org/abs/2505.18028
🔗 Website: lil-lab.github.io/knotgym
Joint work with @yoavartzi.com
KnotGym gives us a lightweight yet expressive testbed for multi-modal long-horizon reasoning and planning.
📄 Paper: arxiv.org/abs/2505.18028
🔗 Website: lil-lab.github.io/knotgym
Joint work with @yoavartzi.com
➡️ Untangle a knot
➡️ Tie a goal knot
➡️ Convert one knot into another
All within Gym + MuJoCo, easy to run, hard to solve.
Even strong RL baselines and VLMs cannot beat random at cross number # X=3 (though they fail for different reasons).
➡️ Untangle a knot
➡️ Tie a goal knot
➡️ Convert one knot into another
All within Gym + MuJoCo, easy to run, hard to solve.
Even strong RL baselines and VLMs cannot beat random at cross number # X=3 (though they fail for different reasons).
Knots are simple to see but deep to reason about.
✔ Verifiable outcomes
✔ Structured complexity (crossing number # X)
✔ A ladder of difficulty for generalization
Perfect for studying long-horizon visual reasoning and test-time scaling in visual space.
Knots are simple to see but deep to reason about.
✔ Verifiable outcomes
✔ Structured complexity (crossing number # X)
✔ A ladder of difficulty for generalization
Perfect for studying long-horizon visual reasoning and test-time scaling in visual space.
Jupyter shines in plotting and interactive demoing. E.g., a use case not fulfilled by console or scripts: prompt engineering. Jupyter (1) does not reload model weights and (2) can fold/clear historical long outputs like logits
Jupyter shines in plotting and interactive demoing. E.g., a use case not fulfilled by console or scripts: prompt engineering. Jupyter (1) does not reload model weights and (2) can fold/clear historical long outputs like logits
goodresearch.dev is good.
A guilty pleasure of mine is reading not only good research repo, but also their full git history if released. Factored code is not always easy to change and a big refactor commit says something.
goodresearch.dev is good.
A guilty pleasure of mine is reading not only good research repo, but also their full git history if released. Factored code is not always easy to change and a big refactor commit says something.
And caring for others, that’s not exactly part of a researcher’s job description or perf review.
I made up the second one to save myself from greater disappointment.
And caring for others, that’s not exactly part of a researcher’s job description or perf review.
I made up the second one to save myself from greater disappointment.
And disclaimer - this is absolutely not affiliated with neurips.
Credit goes to everyone who participated in this mini poll. Thank you - you made my day!
And disclaimer - this is absolutely not affiliated with neurips.
Credit goes to everyone who participated in this mini poll. Thank you - you made my day!