I occasionally post AI memes.
yashbhalgat.github.io
Topics: 3D-VLA models, LLM agents for 3D scene understanding, Robotic control with language.
📢 Call for papers: Deadline – April 20, 2025
🌐 Details: 3d-llm-vla.github.io
#llm #3d #Robotics #ai
Topics: 3D-VLA models, LLM agents for 3D scene understanding, Robotic control with language.
📢 Call for papers: Deadline – April 20, 2025
🌐 Details: 3d-llm-vla.github.io
#llm #3d #Robotics #ai
- Matches/exceeds on most tasks
- Better at math & Chinese tasks
- Strong in-context learning
- Improved dialogue capabilities
(7/8) 🧵
- Matches/exceeds on most tasks
- Better at math & Chinese tasks
- Strong in-context learning
- Improved dialogue capabilities
(7/8) 🧵
On tasks requiring bidirectional reasoning, it outperforms GPT-4 and maintains consistent performance in both forward/reverse directions.
(6/8) 🧵
On tasks requiring bidirectional reasoning, it outperforms GPT-4 and maintains consistent performance in both forward/reverse directions.
(6/8) 🧵
- Low-confidence remasking: Remask tokens the model is least sure about
- Semi-autoregressive: Generate in blocks left-to-right while maintaining bidirectional context
(5/8) 🧵
- Low-confidence remasking: Remask tokens the model is least sure about
- Semi-autoregressive: Generate in blocks left-to-right while maintaining bidirectional context
(5/8) 🧵
The model learns to predict original tokens given partially masked sequences. No causal masking used.
Also enables instruction-conditioned generation with the same technique. No modifications.
(4/8) 🧵
The model learns to predict original tokens given partially masked sequences. No causal masking used.
Also enables instruction-conditioned generation with the same technique. No modifications.
(4/8) 🧵
LLaDA's forward process gradually masks tokens while reverse process predicts them simultaneously. This enables bidirectional modeling.
(3/8) 🧵
LLaDA's forward process gradually masks tokens while reverse process predicts them simultaneously. This enables bidirectional modeling.
(3/8) 🧵
- Successful scaling of masked diffusion to LLM scale (8B params)
- Masking with variable ratios for forward/reverse process
- Smart remasking strategies for generation, incl. semi-autoregressive
- SOTA on reversal tasks, matching Llama 3 on others
(2/8) 🧵
- Successful scaling of masked diffusion to LLM scale (8B params)
- Masking with variable ratios for forward/reverse process
- Smart remasking strategies for generation, incl. semi-autoregressive
- SOTA on reversal tasks, matching Llama 3 on others
(2/8) 🧵
- Consistent Light Attention (CLA) module for stable lighting across frames
- Progressive Light Fusion for smooth temporal transitions
- Works with ANY video diffusion model (AnimateDiff, CogVideoX)
- Zero-shot - no fine-tuning needed!
- Consistent Light Attention (CLA) module for stable lighting across frames
- Progressive Light Fusion for smooth temporal transitions
- Works with ANY video diffusion model (AnimateDiff, CogVideoX)
- Zero-shot - no fine-tuning needed!
- BFS-ordered skeleton sequence representation
- Autoregressive joint prediction with diffusion sampling
- Hybrid attention masking: full self-attention for shape tokens, causal attention for skeleton
- e2e trainable pipeline without clustering/MST ops
- BFS-ordered skeleton sequence representation
- Autoregressive joint prediction with diffusion sampling
- Hybrid attention masking: full self-attention for shape tokens, causal attention for skeleton
- e2e trainable pipeline without clustering/MST ops
New work from UCSD and Adobe:
"RigAnything: Template-Free Autoregressive Rigging
for Diverse 3D Assets" Liu et al.
tl;dr: reduces rigging time from 2 mins to 2 secs, works on any shape category & doesn't need predefined templates! 🚀
New work from UCSD and Adobe:
"RigAnything: Template-Free Autoregressive Rigging
for Diverse 3D Assets" Liu et al.
tl;dr: reduces rigging time from 2 mins to 2 secs, works on any shape category & doesn't need predefined templates! 🚀
tl;dr: Novel framework that integrates 3D awareness into VAE latent space using correspondence-aware encoding, enabling high-quality rendered images with ~50% memory savings.
(1/n) 🧵
tl;dr: Novel framework that integrates 3D awareness into VAE latent space using correspondence-aware encoding, enabling high-quality rendered images with ~50% memory savings.
(1/n) 🧵
Their ArAE model controls face count for varying detail while preserving mesh topology.
Their ArAE model controls face count for varying detail while preserving mesh topology.
Their mesh tokenization algorithm (adapted from EdgeBreaker) achieves ~50% compression (4-5 tokens per face vs 9), making training efficient.
Their mesh tokenization algorithm (adapted from EdgeBreaker) achieves ~50% compression (4-5 tokens per face vs 9), making training efficient.
Code available (pretrained models too) 🤩: github.com/wyysf-98/Cra...
(1/n) 🧵
Code available (pretrained models too) 🤩: github.com/wyysf-98/Cra...
(1/n) 🧵