Johannes Schusterbauer
banner
joh-schb.bsky.social
Johannes Schusterbauer
@joh-schb.bsky.social
PhD Student @ CompVis group, LMU Munich
Working on diffusion & flow models🫶
This work was co-led by @mgui7.bsky.social and me and wouldn't have been possible without the help of all the other collaborators: @timyphan.bsky.social, Felix Krause, @kindsuss.bsky.social, @itsbautistam.bsky.social, and Björn Ommer.

A big thank you to all of them🙏
October 17, 2025 at 10:21 AM
RepTok merges representation learning & generation

A self-supervised token becomes the latent of a generative model.
It’s efficient, continuous, and geometry-preserving - no quantization, no attention overhead.

Check it out
💻 github.com/CompVis/RepTok
📄 arxiv.org/abs/2510.14630
October 17, 2025 at 10:21 AM
Ablations show:
• Works across SSL encoders (DINOv2 best, CLIP & MAE close)
• Cosine-similarity loss balances fidelity vs generativity
• Without SSL priors → reconstructions good, generations collapse
October 17, 2025 at 10:21 AM
RepTok’s geometry stays smooth: linear interpolations in latent space yield natural transitions in both shape and semantics.

This shows that the single-token latent preserves structured continuity - not just abstract semantics.
October 17, 2025 at 10:21 AM
Even with limited training budget we still reach competitive zero-shot FID on MS-COCO - rivaling much larger diffusion models.
October 17, 2025 at 10:21 AM
We also extend RepTok to text-to-image generation using cross-attention to embeddings of frozen language models.

Training: <20 h on 4×A100 GPUs.
October 17, 2025 at 10:21 AM
For generation, we model the latent space directly with an MLP-Mixer (no attention at all!).

Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.

Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
October 17, 2025 at 10:21 AM
Despite using just one token (dim ~768), RepTok reconstructs images faithfully and achieves:

📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9

That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
October 17, 2025 at 10:21 AM
🔑 The key to making it all work:

We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!

❗️This keeps it semantically structured yet reconstruction-aware.
October 17, 2025 at 10:21 AM
💡 The idea

We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.

Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
October 17, 2025 at 10:21 AM
💸 Current diffusion or flow models operate on redundant and expensive 2D latent grids...

VAEs, diffusion AEs, or tokenizers use large number of latent tokens / patches.

But images often share structure that could be represented compactly!
October 17, 2025 at 10:21 AM
If you are interested, feel free to check the paper (arxiv.org/abs/2506.02221) or come by at CVPR:

📌 Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM mode...
arxiv.org
June 6, 2025 at 3:48 PM
It's a framework that bridges Diffusion and Flow Matching paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields. This enables efficient FM finetuning of diffusion priors, retaining their knowledge while giving us the benefits of Flow Matching 🚀
June 6, 2025 at 3:48 PM
Hi, would be happy to be on that list as well. Working on Diffusion & Flow matching at @compvis.bsky.social under the supervision of Björn Ommer..
November 25, 2024 at 2:14 PM