Initial seed noise matters. And you can optimize it **without** any backprop through your denoiser via good-ol linearization. Importantly, you need to do this in the Fourier space.
When distilling vision foundation models with a focus on geometric consistency, insert a feed-forward Gaussian Splatting in the middle.
When distilling vision foundation models with a focus on geometric consistency, insert a feed-forward Gaussian Splatting in the middle.
Sliding window strategy for long sequences. Makes a lot of sense for practical applications -- uses 60 frames at a time with 30 frames overlap, with light loop closure.
Sliding window strategy for long sequences. Makes a lot of sense for practical applications -- uses 60 frames at a time with 30 frames overlap, with light loop closure.
Simple, but seemingly effective idea. Just randomly masking your diffusion supervision seems to lead to less overfitting (of course?). Not to be confused with masked diffusion, this is simply during training.
Simple, but seemingly effective idea. Just randomly masking your diffusion supervision seems to lead to less overfitting (of course?). Not to be confused with masked diffusion, this is simply during training.
Who else likes a nice optimization paper? Gaussian Splatting optimizer that approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson's method.
Who else likes a nice optimization paper? Gaussian Splatting optimizer that approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson's method.
Video models still suffer 3D inconsistencies. Generate video -> VGGT -> DPO for better 3D consistency. My personal question is, will it ever get perfect?
Video models still suffer 3D inconsistencies. Generate video -> VGGT -> DPO for better 3D consistency. My personal question is, will it ever get perfect?
Mean flows, but now in pixel space. Single-step generation with raw pixels has come a long way ;)
Mean flows, but now in pixel space. Single-step generation with raw pixels has come a long way ;)
Training-free method to "fix" 3DGS with diffusion. Render novel views --> SD Edit+Guidance --> refine.
Training-free method to "fix" 3DGS with diffusion. Render novel views --> SD Edit+Guidance --> refine.
Improved version of VGGT-based SLAM. What I find really interesting is Layer 22 -- it shows correspondences and can be used to test for overlaps!
Improved version of VGGT-based SLAM. What I find really interesting is Layer 22 -- it shows correspondences and can be used to test for overlaps!
Lots of videos have moments where the camera"refocuses". Natural that video models can be used to refocus images ;)
Lots of videos have moments where the camera"refocuses". Natural that video models can be used to refocus images ;)
Layout and trajectory-controlled video generator to create video loops from a single image. Neat application, well-engineered.
Layout and trajectory-controlled video generator to create video loops from a single image. Neat application, well-engineered.
Keyframes have been a critical idea for SLAM. How you extract and use them still matters in the era of VGGT. A Reinforcement Learning attempt at it.
Keyframes have been a critical idea for SLAM. How you extract and use them still matters in the era of VGGT. A Reinforcement Learning attempt at it.
Cut3r latents adapted to be used together with Video DiT. 3D models seem to be quite useful for rendering things properly in 3D ;)
Cut3r latents adapted to be used together with Video DiT. 3D models seem to be quite useful for rendering things properly in 3D ;)
Segment your latent into regions, choose which region(s) to denoise based on "complexity" heuristic, then update the rest using past estimates. Ie, each pixel has its own denoising schedule.
Segment your latent into regions, choose which region(s) to denoise based on "complexity" heuristic, then update the rest using past estimates. Ie, each pixel has its own denoising schedule.
From which data do video models learn different types of motion? Finding this, via backtracking gradients, allows data curations and fine-tuning of models to "better" motion.
From which data do video models learn different types of motion? Finding this, via backtracking gradients, allows data curations and fine-tuning of models to "better" motion.
Curated dataset (with vlm etc) to use "frames" as thought chains for text-to-image generations.
Curated dataset (with vlm etc) to use "frames" as thought chains for text-to-image generations.
VGGT + time-conditioned point map estimates. Similar to Monst3r, but with VGGT. Trained to map to a canonical view, and static points live at a “canonical time”.
VGGT + time-conditioned point map estimates. Similar to Monst3r, but with VGGT. Trained to map to a canonical view, and static points live at a “canonical time”.
Updatable Spatial Memory"
"Memory" for video models via point cloud-conditioned video generation. I am obviously still biased towards having these "explicit" 3D stuff.
Updatable Spatial Memory"
"Memory" for video models via point cloud-conditioned video generation. I am obviously still biased towards having these "explicit" 3D stuff.
Train a DiT to recover the full image from point cloud rasters. Is 3d "cueing" all we need? In a similar spirit to other works that "fix" rough 3D renders.
Train a DiT to recover the full image from point cloud rasters. Is 3d "cueing" all we need? In a similar spirit to other works that "fix" rough 3D renders.
Flying pixels in DPT-based models are coming from the fact that DPT modules are convolutional. Introducing MoEs allows you to circumvent that. So...sort of bilateral filtering?
Flying pixels in DPT-based models are coming from the fact that DPT modules are convolutional. Introducing MoEs allows you to circumvent that. So...sort of bilateral filtering?
Visual localization and image matching are always on my radar. Even with "modern" methods, perhaps we'd still want traditional image-based techniques well tied together.
Visual localization and image matching are always on my radar. Even with "modern" methods, perhaps we'd still want traditional image-based techniques well tied together.
Power of ViTs (DINOv3) + Neural Field decoder for resolution-free depth estimates.
Power of ViTs (DINOv3) + Neural Field decoder for resolution-free depth estimates.
A lot. Which makes sense, given the way 3D foundational models are trained is VERY similar to what video models would see.
A lot. Which makes sense, given the way 3D foundational models are trained is VERY similar to what video models would see.
Fine-tune a video model to create the video of what would've created the blurry image. So "live" photos from motion blur I guess? Neat :)
Fine-tune a video model to create the video of what would've created the blurry image. So "live" photos from motion blur I guess? Neat :)
Paper argues that compositional generalization is infeasible in a pure encoder setup, whereas with a decoder, it's easy. Not sure if infeasible, but certainly easier the other way around.
Paper argues that compositional generalization is infeasible in a pure encoder setup, whereas with a decoder, it's easy. Not sure if infeasible, but certainly easier the other way around.
Scene Coordinate Regression networks have shown quite impressive performance when it comes to efficiency. Now, here's how you do SLAM with it. Not as accurate, but MUCH more lean.
Scene Coordinate Regression networks have shown quite impressive performance when it comes to efficiency. Now, here's how you do SLAM with it. Not as accurate, but MUCH more lean.