Vincent Sitzmann
vincentsitzmann.bsky.social
Vincent Sitzmann
@vincentsitzmann.bsky.social
Professor at MIT CSAIL, leading the scene representation group (scenerepresentations.com). We are teaching AI to understand the world through perceiving and interacting with it.
For more information, please visit our paper arxiv.org/abs/2502.06764 and project website boyuan.space/history-guidance and. All credit goes to my students Kiwhan Song (still in his undergrad!) and Boyuan Chen, as well as awesome collaborators Yilun Du, Max Simchowitz, and Russ Tedrake. (7/7)
History-Guided Video Diffusion
Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this ...
arxiv.org
February 11, 2025 at 8:37 PM
We show that DFoT alone is already a competitive model, matching or beating industry SOTA with way more compute than us. Together with HG, it can stably rollout very long videos, stay robust to out-of-distribution context, and stitch sub-trajectories (6/7)
February 11, 2025 at 8:37 PM
DFoT enables History Guidance (HG), a family of history-conditioned guidance methods that composes diffusion scores from different histories. From its simplest form to its most advanced variant, HG significantly enhances video diffusion and unlocks new abilities. (5/7)
February 11, 2025 at 8:37 PM
Unlike previous methods, DFoT views history or target alike as tokens of different noise levels. DFoT trains diffusion with varying noise levels per frame. To conditionally sample, one simply masks out a portion of history with noise before computing the diffusion score. (4/7)
February 11, 2025 at 8:37 PM
Can we train a single model to perform conditional diffusion with different portions of history - variable lengths, subsets of frames, and even different image-domain frequencies? Introducing DFoT, a simple yet flexible add-on that requires no architectural changes. (3/7)
February 11, 2025 at 8:37 PM
Classifier-free Guidance (CFG) has been widely used by video diffusion models to boost sample quality. However, researchers rarely perform CFG beyond the first frame. Our paper finds that an equally important conditioning variable, the history, is the long-ignored key. (2/7)
February 11, 2025 at 8:37 PM
Cool!
January 11, 2025 at 11:31 AM
Was great chatting with your students, cool work!!
December 12, 2024 at 4:45 PM
Wow, indeed!!
December 12, 2024 at 4:44 PM
Wow, what a warm welcome! Thanks, Kosta 🙃
November 23, 2024 at 12:58 AM