Kwang Moo Yi
kmyid.bsky.social
Kwang Moo Yi
@kmyid.bsky.social
Assistant Professor of Computer Science at the University of British Columbia. I also post my daily finds on arxiv.
Pinned
Baek et al., "SONIC: Spectral Optimization of Noise for Inpainting with Consistency"

Initial seed noise matters. And you can optimize it **without** any backprop through your denoiser via good-ol linearization. Importantly, you need to do this in the Fourier space.
Shavin and Benaim, "Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation"

When distilling vision foundation models with a focus on geometric consistency, insert a feed-forward Gaussian Splatting in the middle.
February 6, 2026 at 8:20 PM
Antsfeld et al., "S-MUSt3R: Sliding Multi-view 3D Reconstruction"

Sliding window strategy for long sequences. Makes a lot of sense for practical applications -- uses 60 frames at a time with 30 frames overlap, with light loop closure.
February 5, 2026 at 6:11 PM
Zhao et al., "Sparsely Supervised Diffusion"

Simple, but seemingly effective idea. Just randomly masking your diffusion supervision seems to lead to less overfitting (of course?). Not to be confused with masked diffusion, this is simply during training.
February 4, 2026 at 8:57 PM
Hsiao et al., "3DGS^2 -TR: Scalable Second-Order Trust-Region Method for 3D Gaussian Splatting"

Who else likes a nice optimization paper? Gaussian Splatting optimizer that approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson's method.
February 3, 2026 at 9:58 PM
Du, Ye, Cong et al., "VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation"

Video models still suffer 3D inconsistencies. Generate video -> VGGT -> DPO for better 3D consistency. My personal question is, will it ever get perfect?
February 2, 2026 at 9:55 PM
Lu et al., "One-step Latent-free Image Generation with Pixel Mean Flows"

Mean flows, but now in pixel space. Single-step generation with raw pixels has come a long way ;)
January 30, 2026 at 9:34 PM
Zhou et al., "FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models"

Training-free method to "fix" 3DGS with diffusion. Render novel views --> SD Edit+Guidance --> refine.
January 29, 2026 at 8:28 PM
Maggio and Carlone, "VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction"

Improved version of VGGT-based SLAM. What I find really interesting is Layer 22 -- it shows correspondences and can be used to test for overlaps!
January 28, 2026 at 6:17 PM
Tedla et al., "Learning to Refocus with Video Diffusion Models"

Lots of videos have moments where the camera"refocuses". Natural that video models can be used to refocus images ;)
January 27, 2026 at 7:00 PM
Mahapatra et al., "DreamLoop: Controllable Cinemagraph Generation from a Single Photograph"

Layout and trajectory-controlled video generator to create video loops from a single image. Neat application, well-engineered.
January 26, 2026 at 6:54 PM
Dai et al., "Keyframe-Based Feed-Forward Visual Odometry"

Keyframes have been a critical idea for SLAM. How you extract and use them still matters in the era of VGGT. A Reinforcement Learning attempt at it.
January 23, 2026 at 8:12 PM
Xie et al., "LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models"

Cut3r latents adapted to be used together with Video DiT. 3D models seem to be quite useful for rendering things properly in 3D ;)
January 22, 2026 at 8:25 PM
Lin et al., "SDiT: Semantic Region-Adaptive for Diffusion Transformers"

Segment your latent into regions, choose which region(s) to denoise based on "complexity" heuristic, then update the rest using past estimates. Ie, each pixel has its own denoising schedule.
January 21, 2026 at 7:14 PM
Wu et al., "Motion Attribution for Video Generation"

From which data do video models learn different types of motion? Finding this, via backtracking gradients, allows data curations and fine-tuning of models to "better" motion.
January 20, 2026 at 11:28 PM
Tong and Chang et al., "CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation"

Curated dataset (with vlm etc) to use "frames" as thought chains for text-to-image generations.
January 16, 2026 at 8:21 PM
Sucar and Insafutdinov et al., “V-DPM: 4D Video Reconstruction with Dynamic Point Maps”

VGGT + time-conditioned point map estimates. Similar to Monst3r, but with VGGT. Trained to map to a canonical view, and static points live at a “canonical time”.
January 15, 2026 at 8:26 PM
Zhao and Wei et al., "Spatia: Video Generation with
Updatable Spatial Memory"

"Memory" for video models via point cloud-conditioned video generation. I am obviously still biased towards having these "explicit" 3D stuff.
January 14, 2026 at 8:23 PM
Wu et al., "From Rays to Projections: Better Inputs for Feed-Forward View Synthesis"

Train a DiT to recover the full image from point cloud rasters. Is 3d "cueing" all we need? In a similar spirit to other works that "fix" rough 3D renders.
January 12, 2026 at 7:36 PM
Wang et al., "MoE3D: A Mixture-of-Experts Module for 3D Reconstruction"

Flying pixels in DPT-based models are coming from the fact that DPT modules are convolutional. Introducing MoEs allows you to circumvent that. So...sort of bilateral filtering?
January 9, 2026 at 7:24 PM
Jiang et al., "ImLoc: Revisiting Visual Localization with Image-based Representation"

Visual localization and image matching are always on my radar. Even with "modern" methods, perhaps we'd still want traditional image-based techniques well tied together.
January 8, 2026 at 8:07 PM
Yu, Lin, and Wang et al., "InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields"

Power of ViTs (DINOv3) + Neural Field decoder for resolution-free depth estimates.
January 7, 2026 at 8:36 PM
Huang and Li et al., “How Much 3D Do Video Foundation Models Encode?”

A lot. Which makes sense, given the way 3D foundational models are trained is VERY similar to what video models would see.
January 7, 2026 at 8:35 PM
Tedla et al., "Generating the Past, Present and Future from a Motion-Blurred Image"

Fine-tune a video model to create the video of what would've created the blurry image. So "live" photos from motion blur I guess? Neat :)
December 24, 2025 at 9:46 PM
Brady et al., "Generation is Required for Data-Efficient Perception"

Paper argues that compositional generalization is infeasible in a pure encoder setup, whereas with a decoder, it's easy. Not sure if infeasible, but certainly easier the other way around.
December 23, 2025 at 10:18 PM
Alzugaray et al., "ACE-SLAM: Scene Coordinate Regression for Neural Implicit Real-Time SLAM"

Scene Coordinate Regression networks have shown quite impressive performance when it comes to efficiency. Now, here's how you do SLAM with it. Not as accurate, but MUCH more lean.
December 22, 2025 at 8:34 PM