Yash Bhalgat
banner
ysbhalgat.bsky.social
Yash Bhalgat
@ysbhalgat.bsky.social
PhD at VGG, Oxford w/ Andrew Zisserman, Andrea Vedaldi, Joao Henriques, Iro Laina. Past: Senior RS Qualcomm #AI #Research, UMich, IIT Bombay.

I occasionally post AI memes.

yashbhalgat.github.io
Excited to announce the 1st Workshop on 3D-LLM/VLA at #CVPR2025! 🚀 @cvprconference.bsky.social

Topics: 3D-VLA models, LLM agents for 3D scene understanding, Robotic control with language.

📢 Call for papers: Deadline – April 20, 2025

🌐 Details: 3d-llm-vla.github.io

#llm #3d #Robotics #ai
March 23, 2025 at 9:35 PM
Results vs LLaMA3 8B:

- Matches/exceeds on most tasks
- Better at math & Chinese tasks
- Strong in-context learning
- Improved dialogue capabilities

(7/8) 🧵
February 18, 2025 at 3:07 PM
A major result: LLaDA breaks the "reversal curse" that plagues autoregressive models. 🔄

On tasks requiring bidirectional reasoning, it outperforms GPT-4 and maintains consistent performance in both forward/reverse directions.

(6/8) 🧵
February 18, 2025 at 3:07 PM
For generation, they introduce clever remasking strategies:

- Low-confidence remasking: Remask tokens the model is least sure about

- Semi-autoregressive: Generate in blocks left-to-right while maintaining bidirectional context

(5/8) 🧵
February 18, 2025 at 3:07 PM
Training uses random masking ratio t ∈ [0,1] for each sequence.

The model learns to predict original tokens given partially masked sequences. No causal masking used.

Also enables instruction-conditioned generation with the same technique. No modifications.

(4/8) 🧵
February 18, 2025 at 3:06 PM
💡Core insight: Generative modeling principles, not autoregression, give LLMs their power.

LLaDA's forward process gradually masks tokens while reverse process predicts them simultaneously. This enables bidirectional modeling.

(3/8) 🧵
February 18, 2025 at 3:06 PM
Key highlights:
- Successful scaling of masked diffusion to LLM scale (8B params)
- Masking with variable ratios for forward/reverse process
- Smart remasking strategies for generation, incl. semi-autoregressive
- SOTA on reversal tasks, matching Llama 3 on others

(2/8) 🧵
February 18, 2025 at 3:05 PM
"LLaDA: Large Language Diffusion Models" Nie et al.

Just read this fascinating paper.

Scaled up Masked Diffusion Language Models to 8B params, and show that it can match #LLMs (including Llama 3) while solving some key limitations!

Let's dive in... 🧵

(1/8)

#genai
February 18, 2025 at 3:05 PM
Technical highlights 🔍:
- Consistent Light Attention (CLA) module for stable lighting across frames
- Progressive Light Fusion for smooth temporal transitions
- Works with ANY video diffusion model (AnimateDiff, CogVideoX)
- Zero-shot - no fine-tuning needed!
February 16, 2025 at 4:27 PM
New work introduces a training-free method to relight entire videos, while maintaining temporal consistency! 📽️🌅

"Light-A-Video: Training-free Video Relighting via Progressive Light Fusion" Zhou et al.

(1/n) 🧵

#genai #ai #research #video
February 16, 2025 at 4:26 PM
Authors claim that the model generalizes well across diverse shapes - from humanoids to marine creatures! And works with real-world images & arbitrary poses. 🤩
February 15, 2025 at 1:06 PM
Technical highlights:
- BFS-ordered skeleton sequence representation
- Autoregressive joint prediction with diffusion sampling
- Hybrid attention masking: full self-attention for shape tokens, causal attention for skeleton
- e2e trainable pipeline without clustering/MST ops
February 15, 2025 at 1:05 PM
Need to rig 3D models? 🦖

New work from UCSD and Adobe:
"RigAnything: Template-Free Autoregressive Rigging
for Diverse 3D Assets" Liu et al.

tl;dr: reduces rigging time from 2 mins to 2 secs, works on any shape category & doesn't need predefined templates! 🚀
February 15, 2025 at 1:05 PM
Technical approach:
- Correspondence-aware autoencoding to enhance 3D consistency in VAE latent space
- Builds 3D representations from 3D-aware 2D features
- VAE-Radiance Field alignment to bridge domain gap between latent and image space

#nerf #ai #research
February 14, 2025 at 10:28 AM
"Latent Radiance Fields with 3D-aware 2D Representations" Zhou et al., #ICLR2025

tl;dr: Novel framework that integrates 3D awareness into VAE latent space using correspondence-aware encoding, enabling high-quality rendered images with ~50% memory savings.

(1/n) 🧵
February 14, 2025 at 10:28 AM
The architecture uses a lightweight encoder and auto-regressive decoder to compress variable-length meshes into fixed-length codes, enabling point cloud and single-image conditioning.

Their ArAE model controls face count for varying detail while preserving mesh topology.
February 13, 2025 at 10:36 PM
"EdgeRunner" (#ICLR2025) from #Nvidia & PKU introduces an auto-regressive auto-encoder for mesh generation, supporting up to 4000 faces at 512³ resolution. 🤩

Their mesh tokenization algorithm (adapted from EdgeBreaker) achieves ~50% compression (4-5 tokens per face vs 9), making training efficient.
February 13, 2025 at 10:34 PM
Technical highlight: They combine 3D latent diffusion with multi-view conditioning for the base shape, then use 2D normal maps for refinement. The results look way cleaner than previous methods.
February 12, 2025 at 10:11 PM
Their two-stage approach: First generate coarse geometry (5s), then add fine details (20s) using normal maps based refinement. Smart way to balance speed and quality.
February 12, 2025 at 10:10 PM
Just came across this fascinating paper "CraftsMan3D" - a practical approach to text/image-to-3D generation that mimics how artists actually work!

Code available (pretrained models too) 🤩: github.com/wyysf-98/Cra...

(1/n) 🧵
February 12, 2025 at 10:10 PM
Got me excited for a second here 🫠
February 10, 2025 at 12:12 PM
So, what happened this week in #AI?
January 29, 2025 at 12:57 PM
(3/n) ⚡️ Speed matters: GSLoc adds just ~180ms overhead while providing substantial accuracy gains. We also provide GSLoc_rel variant for even faster refinement when runtime is critical.
January 23, 2025 at 11:53 AM
(2/n) 📈 Results: GSLoc achieves new SOTA on indoor datasets (7Scenes & 12Scenes) and significantly improves accuracy on Cambridge Landmarks. Our one-shot refinement outperforms methods requiring 50+ optimization steps!
January 23, 2025 at 11:52 AM
(1/n) 🔑 Key idea: We use 3DGS to render high-quality synthetic images & depth maps, enabling efficient one-shot pose refinement of existing APR and SCR methods. No need for iterative optimization or training specialized feature extractors!
January 23, 2025 at 11:52 AM