Kwang Moo Yi
@kmyid.bsky.social
Assistant Professor of Computer Science at the University of British Columbia. I also post my daily finds on arxiv.
Wang et al., "Seeds of Structure: Patch PCA Reveals Universal Compositional Cues in Diffusion Models"
The existence of single (few) step denoisers, and many recent works hinted at this, but another one. You can decode the image structure from the initial noise fairly easily.
The existence of single (few) step denoisers, and many recent works hinted at this, but another one. You can decode the image structure from the initial noise fairly easily.
November 10, 2025 at 7:42 PM
Wang et al., "Seeds of Structure: Patch PCA Reveals Universal Compositional Cues in Diffusion Models"
The existence of single (few) step denoisers, and many recent works hinted at this, but another one. You can decode the image structure from the initial noise fairly easily.
The existence of single (few) step denoisers, and many recent works hinted at this, but another one. You can decode the image structure from the initial noise fairly easily.
Ren and Wen et al., "FastGS: Training 3D Gaussian Splatting in 100 Seconds"
I like simple ideas -- this one says you should consider multiple views when you prune/clone, which allows fewer Gaussians to be used for training.
I like simple ideas -- this one says you should consider multiple views when you prune/clone, which allows fewer Gaussians to be used for training.
November 7, 2025 at 6:32 PM
Ren and Wen et al., "FastGS: Training 3D Gaussian Splatting in 100 Seconds"
I like simple ideas -- this one says you should consider multiple views when you prune/clone, which allows fewer Gaussians to be used for training.
I like simple ideas -- this one says you should consider multiple views when you prune/clone, which allows fewer Gaussians to be used for training.
Nguyen et al., "IBGS: Image-Based Gaussian Splatting"
Gaussian Splatting + Image-based rendering (ie, copy things over directly from nearby views). When your Gaussians cannot describe highlights, let your nearby images guide you.
Gaussian Splatting + Image-based rendering (ie, copy things over directly from nearby views). When your Gaussians cannot describe highlights, let your nearby images guide you.
November 6, 2025 at 9:45 PM
Nguyen et al., "IBGS: Image-Based Gaussian Splatting"
Gaussian Splatting + Image-based rendering (ie, copy things over directly from nearby views). When your Gaussians cannot describe highlights, let your nearby images guide you.
Gaussian Splatting + Image-based rendering (ie, copy things over directly from nearby views). When your Gaussians cannot describe highlights, let your nearby images guide you.
Gao and Mao et al., "Seeing the Wind from a Falling Leaf"
Extract Dynamic 3D Gaussians for an object -> Vision Language Models to extract physics parameters -> model force field (wind). Leads to some fun.
Extract Dynamic 3D Gaussians for an object -> Vision Language Models to extract physics parameters -> model force field (wind). Leads to some fun.
November 5, 2025 at 5:31 PM
Gao and Mao et al., "Seeing the Wind from a Falling Leaf"
Extract Dynamic 3D Gaussians for an object -> Vision Language Models to extract physics parameters -> model force field (wind). Leads to some fun.
Extract Dynamic 3D Gaussians for an object -> Vision Language Models to extract physics parameters -> model force field (wind). Leads to some fun.
Zhou et al., "PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception"
VGGT extended to dynamic scenes with a dynamic mask predictor.
VGGT extended to dynamic scenes with a dynamic mask predictor.
November 4, 2025 at 8:17 PM
Zhou et al., "PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception"
VGGT extended to dynamic scenes with a dynamic mask predictor.
VGGT extended to dynamic scenes with a dynamic mask predictor.
Pfrommer et al., "Is Your Diffusion Model Actually Denoising?"
Apparently no. This indeed matches the empirical behavior of these models that I experienced. These models are approximate, but then what actually is their mathematical property?
Apparently no. This indeed matches the empirical behavior of these models that I experienced. These models are approximate, but then what actually is their mathematical property?
November 3, 2025 at 7:28 PM
Pfrommer et al., "Is Your Diffusion Model Actually Denoising?"
Apparently no. This indeed matches the empirical behavior of these models that I experienced. These models are approximate, but then what actually is their mathematical property?
Apparently no. This indeed matches the empirical behavior of these models that I experienced. These models are approximate, but then what actually is their mathematical property?
Tesfaldet et al., "Generative Point Tracking with Flow Matching"
Tracking, waaaaaay back in the days, used to be solved using sampling methods. They are now back. Also reminds me of my first major conference work, where I looked into how much impact the initial target point has.
Tracking, waaaaaay back in the days, used to be solved using sampling methods. They are now back. Also reminds me of my first major conference work, where I looked into how much impact the initial target point has.
October 31, 2025 at 6:42 PM
Tesfaldet et al., "Generative Point Tracking with Flow Matching"
Tracking, waaaaaay back in the days, used to be solved using sampling methods. They are now back. Also reminds me of my first major conference work, where I looked into how much impact the initial target point has.
Tracking, waaaaaay back in the days, used to be solved using sampling methods. They are now back. Also reminds me of my first major conference work, where I looked into how much impact the initial target point has.
Stary and Gaubil et al., "Understanding multi-view transformers"
We use Dust3r as a black box. This work looks under the hood at what is going on. The internal representations seem to "iteratively" refine towards the final answer. Quite similar to what goes on in point cloud net
We use Dust3r as a black box. This work looks under the hood at what is going on. The internal representations seem to "iteratively" refine towards the final answer. Quite similar to what goes on in point cloud net
October 30, 2025 at 9:00 PM
Stary and Gaubil et al., "Understanding multi-view transformers"
We use Dust3r as a black box. This work looks under the hood at what is going on. The internal representations seem to "iteratively" refine towards the final answer. Quite similar to what goes on in point cloud net
We use Dust3r as a black box. This work looks under the hood at what is going on. The internal representations seem to "iteratively" refine towards the final answer. Quite similar to what goes on in point cloud net
Goren and Yehezkel et al., "Visual Diffusion Models are Geometric Solvers"
Note: this paper does not claim that diffusion models are better; in fact, specialized models are. Just shows the potential that you can use diffusion models to solve geometric problems.
Note: this paper does not claim that diffusion models are better; in fact, specialized models are. Just shows the potential that you can use diffusion models to solve geometric problems.
October 29, 2025 at 9:28 PM
Goren and Yehezkel et al., "Visual Diffusion Models are Geometric Solvers"
Note: this paper does not claim that diffusion models are better; in fact, specialized models are. Just shows the potential that you can use diffusion models to solve geometric problems.
Note: this paper does not claim that diffusion models are better; in fact, specialized models are. Just shows the potential that you can use diffusion models to solve geometric problems.
Luo et al., "Self-diffusion for Solving Inverse Problems"
Pretty much a deep image prior for denoising models. Without ANY data, with a single image, you can train a denoiser via diffusion training, and it just magically learns to solve inverse problems.
Pretty much a deep image prior for denoising models. Without ANY data, with a single image, you can train a denoiser via diffusion training, and it just magically learns to solve inverse problems.
October 27, 2025 at 6:59 PM
Luo et al., "Self-diffusion for Solving Inverse Problems"
Pretty much a deep image prior for denoising models. Without ANY data, with a single image, you can train a denoiser via diffusion training, and it just magically learns to solve inverse problems.
Pretty much a deep image prior for denoising models. Without ANY data, with a single image, you can train a denoiser via diffusion training, and it just magically learns to solve inverse problems.
Bai et al., "Positional Encoding Field"
Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.
Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.
October 24, 2025 at 6:20 PM
Bai et al., "Positional Encoding Field"
Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.
Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.
Mao et al., "PoseCrafter: Extreme Pose Estimation with
Hybrid Video Synthesis"
While not perfect, video models do an okay job of creating novel views. Use them to "bridge" between extreme views for pose estimation.
Hybrid Video Synthesis"
While not perfect, video models do an okay job of creating novel views. Use them to "bridge" between extreme views for pose estimation.
October 23, 2025 at 6:48 PM
Mao et al., "PoseCrafter: Extreme Pose Estimation with
Hybrid Video Synthesis"
While not perfect, video models do an okay job of creating novel views. Use them to "bridge" between extreme views for pose estimation.
Hybrid Video Synthesis"
While not perfect, video models do an okay job of creating novel views. Use them to "bridge" between extreme views for pose estimation.
Choudhury and Kim et al., "Accelerating Vision Transformers With Adaptive Patch Sizes"
Transformer patches don't need to be of uniform size -- choose sizes based on entropy --> faster training/inference. Are scale-spaces gonna make a comeback?
Transformer patches don't need to be of uniform size -- choose sizes based on entropy --> faster training/inference. Are scale-spaces gonna make a comeback?
October 22, 2025 at 8:08 PM
Choudhury and Kim et al., "Accelerating Vision Transformers With Adaptive Patch Sizes"
Transformer patches don't need to be of uniform size -- choose sizes based on entropy --> faster training/inference. Are scale-spaces gonna make a comeback?
Transformer patches don't need to be of uniform size -- choose sizes based on entropy --> faster training/inference. Are scale-spaces gonna make a comeback?
Riise et al., "Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling"
Beam search with Autoregressive image generators with verifiers.
Beam search with Autoregressive image generators with verifiers.
October 21, 2025 at 7:42 PM
Riise et al., "Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling"
Beam search with Autoregressive image generators with verifiers.
Beam search with Autoregressive image generators with verifiers.
Hakie and Lu et al., "Fix False Transparency by Noise Guided Splatting"
Pretty cute idea -- Gaussian splats are often transparent, although we don't want them to be. So, just fill your splats in with noise during optimization to make them non-transparent.
Pretty cute idea -- Gaussian splats are often transparent, although we don't want them to be. So, just fill your splats in with noise during optimization to make them non-transparent.
October 20, 2025 at 9:35 PM
Hakie and Lu et al., "Fix False Transparency by Noise Guided Splatting"
Pretty cute idea -- Gaussian splats are often transparent, although we don't want them to be. So, just fill your splats in with noise during optimization to make them non-transparent.
Pretty cute idea -- Gaussian splats are often transparent, although we don't want them to be. So, just fill your splats in with noise during optimization to make them non-transparent.
Alzayer et al., "Coupled Diffusion Sampling for Training-Free Multi-View Image Editing"
You can "guide" diffusion models with different purposes by "coupling them". Our group did simply weighted averaging without math in vivid-123, but this is much more sound!
You can "guide" diffusion models with different purposes by "coupling them". Our group did simply weighted averaging without math in vivid-123, but this is much more sound!
October 17, 2025 at 7:00 PM
Alzayer et al., "Coupled Diffusion Sampling for Training-Free Multi-View Image Editing"
You can "guide" diffusion models with different purposes by "coupling them". Our group did simply weighted averaging without math in vivid-123, but this is much more sound!
You can "guide" diffusion models with different purposes by "coupling them". Our group did simply weighted averaging without math in vivid-123, but this is much more sound!
Bruns et al., "ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training"
Train a scene coordinate regressor with "map codes" (ie, trainable inputs) so that you can train one generalizable regressor. Then, find these "map codes" to localize.
Train a scene coordinate regressor with "map codes" (ie, trainable inputs) so that you can train one generalizable regressor. Then, find these "map codes" to localize.
October 16, 2025 at 7:37 PM
Bruns et al., "ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training"
Train a scene coordinate regressor with "map codes" (ie, trainable inputs) so that you can train one generalizable regressor. Then, find these "map codes" to localize.
Train a scene coordinate regressor with "map codes" (ie, trainable inputs) so that you can train one generalizable regressor. Then, find these "map codes" to localize.
Shrivastava and Mehta et al., "Point Prompting: Counterfactual Tracking with Video Diffusion Models"
Put a red dot where you want to track, and SDEdit the video with a video model --> zero-shot point tracking. Not as good as supervised ones, but zero-shot!
Put a red dot where you want to track, and SDEdit the video with a video model --> zero-shot point tracking. Not as good as supervised ones, but zero-shot!
October 15, 2025 at 6:39 PM
Shrivastava and Mehta et al., "Point Prompting: Counterfactual Tracking with Video Diffusion Models"
Put a red dot where you want to track, and SDEdit the video with a video model --> zero-shot point tracking. Not as good as supervised ones, but zero-shot!
Put a red dot where you want to track, and SDEdit the video with a video model --> zero-shot point tracking. Not as good as supervised ones, but zero-shot!
Yuan et al., "LikePhys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference"
I will keep promoting physics benchmark papers for video models until people stop claiming world models :) tl;dr -- Still not there yet.
I will keep promoting physics benchmark papers for video models until people stop claiming world models :) tl;dr -- Still not there yet.
October 14, 2025 at 6:32 PM
Yuan et al., "LikePhys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference"
I will keep promoting physics benchmark papers for video models until people stop claiming world models :) tl;dr -- Still not there yet.
I will keep promoting physics benchmark papers for video models until people stop claiming world models :) tl;dr -- Still not there yet.
Xu et al., "ReSplat: Learning Recurrent Gaussian Splats"
Feed-forward Gaussian Splatting + Learned Corrector = Fast high-quality reconstruction. Uses global + kNN attention. Reminds me of pointnet++
Feed-forward Gaussian Splatting + Learned Corrector = Fast high-quality reconstruction. Uses global + kNN attention. Reminds me of pointnet++
October 10, 2025 at 7:23 PM
Xu et al., "ReSplat: Learning Recurrent Gaussian Splats"
Feed-forward Gaussian Splatting + Learned Corrector = Fast high-quality reconstruction. Uses global + kNN attention. Reminds me of pointnet++
Feed-forward Gaussian Splatting + Learned Corrector = Fast high-quality reconstruction. Uses global + kNN attention. Reminds me of pointnet++
Xu and Lin et al., "Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers"
Append foundational features at the later stages when doing marigold-like denoising to get monocular depth. Simple straightforward idea that works.
Append foundational features at the later stages when doing marigold-like denoising to get monocular depth. Simple straightforward idea that works.
October 9, 2025 at 8:14 PM
Xu and Lin et al., "Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers"
Append foundational features at the later stages when doing marigold-like denoising to get monocular depth. Simple straightforward idea that works.
Append foundational features at the later stages when doing marigold-like denoising to get monocular depth. Simple straightforward idea that works.
Bamberger and Jones et al., "Carré du champ flow matching: better quality-generalisation tradeoff in generative models"
Geometric regularization of the flow manifold. Boils down to adding anisotropic Gaussian Noise to flow matching training. Neat idea, enhances generalization.
Geometric regularization of the flow manifold. Boils down to adding anisotropic Gaussian Noise to flow matching training. Neat idea, enhances generalization.
October 8, 2025 at 6:27 PM
Bamberger and Jones et al., "Carré du champ flow matching: better quality-generalisation tradeoff in generative models"
Geometric regularization of the flow manifold. Boils down to adding anisotropic Gaussian Noise to flow matching training. Neat idea, enhances generalization.
Geometric regularization of the flow manifold. Boils down to adding anisotropic Gaussian Noise to flow matching training. Neat idea, enhances generalization.
Yugay and Nguyen et al., “Visual Odometry with Transformers”
Instead of point maps, you can also directly output poses. This used to be much less accurate, but now it's the opposite. Simple architecture that directly predicts camera embeddings, which then regress rot and trans.
Instead of point maps, you can also directly output poses. This used to be much less accurate, but now it's the opposite. Simple architecture that directly predicts camera embeddings, which then regress rot and trans.
October 7, 2025 at 4:20 PM
Yugay and Nguyen et al., “Visual Odometry with Transformers”
Instead of point maps, you can also directly output poses. This used to be much less accurate, but now it's the opposite. Simple architecture that directly predicts camera embeddings, which then regress rot and trans.
Instead of point maps, you can also directly output poses. This used to be much less accurate, but now it's the opposite. Simple architecture that directly predicts camera embeddings, which then regress rot and trans.
Chen et al., "TTT3R: 3D Reconstruction as Test-Time Training"
Cut3R + gated updates for states (test-time training layers) = fast/efficient performance of cut3r, but with high-quality estimates.
Cut3R + gated updates for states (test-time training layers) = fast/efficient performance of cut3r, but with high-quality estimates.
October 6, 2025 at 5:27 PM
Chen et al., "TTT3R: 3D Reconstruction as Test-Time Training"
Cut3R + gated updates for states (test-time training layers) = fast/efficient performance of cut3r, but with high-quality estimates.
Cut3R + gated updates for states (test-time training layers) = fast/efficient performance of cut3r, but with high-quality estimates.
Two today: Kim et al., "How Diffusion Models Memorize" and Song and Kim et al., "Selective Underfitting in Diffusion Models"
A deep dive into how memorization and generalization happen in diffusion models. Still trying to digest what these mean. Though-provoking.
A deep dive into how memorization and generalization happen in diffusion models. Still trying to digest what these mean. Though-provoking.
October 3, 2025 at 6:37 PM
Two today: Kim et al., "How Diffusion Models Memorize" and Song and Kim et al., "Selective Underfitting in Diffusion Models"
A deep dive into how memorization and generalization happen in diffusion models. Still trying to digest what these mean. Though-provoking.
A deep dive into how memorization and generalization happen in diffusion models. Still trying to digest what these mean. Though-provoking.