Stefan Baumann
stefanabaumann.bsky.social
Stefan Baumann
@stefanabaumann.bsky.social
1.3K followers 650 following 91 posts
PhD Student at @compvis.bsky.social & @ellis.eu working on generative computer vision. Interested in extracting world understanding from models and more controlled generation. 🌐 https://stefan-baumann.eu/
Posts Media Videos Starter Packs
Pinned
Ever wondered if diffusion features could do better without all the noise? 🤔

Turns out they can! We show how adapting the backbone unlocks clean, powerful features for better results across the board. 🚀🧹

Check it out! ⬇️
🤔 Why do we extract diffusion features from noisy images? Isn’t that destroying information?

Yes, it is - but we found a way to do better. 🚀

Here’s how we unlock better features, no noise, no hassle.

📝 Project Page: compvis.github.io/cleandift
💻 Code: github.com/CompVis/clea...

🧵👇
The work I linked is relating to pretraining, too. Doing this for multiple rewards at once is indeed an aspect I haven't seen previously, I was just curious whether I was missing something about the general method
Lovely work!
Let's make everything generative! No reason to forgo the availability of an (at least implicit) distribution for every prediction to make, if we can make it at least as accurate and similarly efficient as discriminative baselines in the long run
Reposted by Stefan Baumann
Excited to share that we'll be presenting four papers at the main conference at ICCV 2025 this week!

Come say hi in Honolulu!

👋 Pingchuan, Ming, Felix, Stefan, Timy, and Björn Ommer will be attending.
Reposted by Stefan Baumann
🤔 What if you could generate an entire image using just one continuous token?

💡 It works if we leverage a self-supervised representation!

Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇
Thank you! I think that might be possible, although I'd likely consider incorporating more information in that case
All of this wouldn't have been possible without the support of my amazing collaborators
@rmsnorm.bsky.social, @timyphan.bsky.social, and Björn Ommer at @compvis.bsky.social. A giant thank you to them! ❤️
⚡️ FPT generalizes from open-set training. Applications:
• Articulated motion (Drag-A-Move): fine-tuned FPT outperforms specialized models for motion prediction
• Face motion: zero-shot, beats specialized baselines
• Moving part segmentation: emerges from formulation
⚙️ Unlike other methods, we don't regress or sample one trajectory.
FPT 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘴 𝘵𝘩𝘦 𝘧𝘶𝘭𝘭 𝘮𝘰𝘵𝘪𝘰𝘯 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯, enabling:
• interpretable uncertainty
• controllable interaction effects
• efficient prediction (>100k predictions/s)
💡 Our idea:
Predict 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀 of motion, not just one flow field instance.

Given a few pokes, our model outputs the probability 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of how parts of the scene might move.

→ This directly captures 𝘶𝘯𝘤𝘦𝘳𝘵𝘢𝘪𝘯𝘵𝘺 and interactions.
🧠 Understanding how the world 𝘤𝘰𝘶𝘭𝘥 change is core to physical intelligence.

But most models predict 𝗼𝗻𝗲 𝗳𝘂𝘁𝘂𝗿𝗲, a single deterministic motion.

The reality is 𝘶𝘯𝘤𝘦𝘳𝘵𝘢𝘪𝘯 and 𝘮𝘶𝘭𝘵𝘪-𝘮𝘰𝘥𝘢𝘭: one poke can lead to many outcomes.
🤔 What happens when you poke a scene — and your model has to predict how the world moves in response?

We built the Flow Poke Transformer (FPT) to model multi-modal scene dynamics from sparse interactions.

It learns to predict the 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of motion itself 🧵👇
Oh yeah, sorry, I should've made it more clear that I was talking in the more general case
Let's for example say (zero-shot) semantic correspondence working quite well based on activations of image diffusion models.

The model has never been trained for it, and, while it's obvious that related capabilities might be useful for denoising, I'd still consider this an emergent capability
Not in the sense of, e.g., generating new kinds of videos when the model was trained for video generation, but capabilities w.r.t. other tasks could still be considered emergent, right?
First time I ever hear someone from the 3D CV community actually say this out loud! This has been bugging me for a long time
Why are you not on a current stable version?
The bugs I ran into reproduce across 2.7, 2.8 and current nightlies
Welcome to the club! I've somehow managed to find two bugs with torch.compile() in the last few days 🥲
Reposted by Stefan Baumann
“Everyone knows” what an autoencoder is… but there's an important complementary picture missing from most introductory material.

In short: we emphasize how autoencoders are implemented—but not always what they represent (and some of the implications of that representation).🧵
That process really sounds like a labor of love! Penrose looks really interesting, I'll play around with it! Thanks!