Thomas Wimmer
banner
wimmerthomas.bsky.social
Thomas Wimmer
@wimmerthomas.bsky.social
PhD Candidate at the Max Planck ETH Center for Learning Systems working on 3D Computer Vision.

https://wimmerth.github.io
Try it out now! Code and model weights are public.

💻 Code: github.com/wimmerth/anyup

Great collaboration with Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and @janericlenssen.bsky.social!

CC: @cvml.mpi-inf.mpg.de @mpi-inf.mpg.de
AnyUp
Universal Feature Upsampling
wimmerth.github.io
October 16, 2025 at 9:07 AM
Generalization: AnyUp is the first learned upsampler that can be applied out-of-the-box to other features of potentially different dimensionality.

In our experiments, we show that it matches encoder-specific upsamplers and that trends between different model sizes are preserved.
October 16, 2025 at 9:07 AM
When performing linear probing for semantic segmentation or normal and depth estimation, AnyUp consistently outperforms prior upsamplers.

Importantly, the upsampled features also stay faithful to the input feature space, as we show in experiments with pre-trained DINOv2 probes.
October 16, 2025 at 9:07 AM
AnyUp is a lightweight model that uses a feature-agnostic layer to obtain a canonical representation that is independent of the input dimensionality.

Together with window attention-based upsampling, a new training pipeline and consistency regularization, we achieve SOTA results.
October 16, 2025 at 9:07 AM
Foundation models like DINO or CLIP are used in almost all modern computer vision applications.

However, their features are of low resolution and many applications need pixel-wise features instead.

AnyUp can upsample any features of any dimensionality to any resolution.
October 16, 2025 at 9:07 AM
Reposted by Thomas Wimmer
Suppose you have separate datasets X, Y, Z, without known correspondences.

We do the simplest thing: just train a model (e.g., a next-token predictor) on all elements of the concatenated dataset [X,Y,Z].

You end up with a better model of dataset X than if you had trained on X alone!

6/9
October 10, 2025 at 10:13 PM
What was the patch size used here?
August 21, 2025 at 11:44 AM
All the links can be found here. Great collaborators!

bsky.app/profile/odue...
🔗Project page: genintel.github.io/DIY-SC
📄Paper: arxiv.org/pdf/2506.05312
💻Code: github.com/odunkel/DIY-SC
🤗Demo: huggingface.co/spaces/odunk...

Great collaboration with @wimmerthomas.bsky.social , Christian Theobalt, Christian Rupprecht, and @adamkortylewski.bsky.social ! [6/6]
June 26, 2025 at 2:30 PM
We only use open-sourced models and the implementation of our method is readily available. Please check out the paper website for more details:

wimmerth.github.io/gaussians2li...
Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes
We introduce a method to animate given 3D scenes that uses pre-trained models to lift 2D motion into 3D. We propose a training-free, autoregressive method to generate more 3D-consi...
wimmerth.github.io
March 28, 2025 at 8:35 AM
We can animate arbitrary 3D scenes within 10 minutes on a RTX4090 while keeping scene appearance and geometry in tact.

Note, that since the time I worked on this, open-sourced video diffusion models have improved significantly, which will directly improve the results of this method as well.

🧵⬇️
March 28, 2025 at 8:35 AM
While we can now transfer motion into 3D, we still have to deal with a fundamental problem: Lacking 3D consistency of generated videos.
With limited resources, we can't fine-tune or retrain a VDM to be pose-conditioned. Thus, we propose a zero-shot technique to generate more 3D-consistent videos!
🧵⬇️
March 28, 2025 at 8:35 AM
Standard practices like SDS fail for this task as VDMs provide a guidance signal that is too noisy, resulting in "exploding" scenes.

Instead, we propose to employ several pre-trained 2D models to directly lift motion from tracked points in the generated videos to 3D Gaussians.

🧵⬇️
March 28, 2025 at 8:35 AM
I wonder to which degree one could artificially make real images (with GT depth) more abstract during training in order to make depth models learn these priors that we would have (like green=field, blue=sky) and whether that would actually give us any benefit, like increased robustness...
February 14, 2025 at 2:30 PM
Ah, thanks, I overlooked that :)
February 14, 2025 at 2:19 PM