Johannes Schusterbauer
banner
joh-schb.bsky.social
Johannes Schusterbauer
@joh-schb.bsky.social
PhD Student @ CompVis group, LMU Munich
Working on diffusion & flow models🫶
RepTok’s geometry stays smooth: linear interpolations in latent space yield natural transitions in both shape and semantics.

This shows that the single-token latent preserves structured continuity - not just abstract semantics.
October 17, 2025 at 10:21 AM
Even with limited training budget we still reach competitive zero-shot FID on MS-COCO - rivaling much larger diffusion models.
October 17, 2025 at 10:21 AM
We also extend RepTok to text-to-image generation using cross-attention to embeddings of frozen language models.

Training: <20 h on 4×A100 GPUs.
October 17, 2025 at 10:21 AM
For generation, we model the latent space directly with an MLP-Mixer (no attention at all!).

Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.

Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
October 17, 2025 at 10:21 AM
Despite using just one token (dim ~768), RepTok reconstructs images faithfully and achieves:

📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9

That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
October 17, 2025 at 10:21 AM
🔑 The key to making it all work:

We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!

❗️This keeps it semantically structured yet reconstruction-aware.
October 17, 2025 at 10:21 AM
💡 The idea

We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.

Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
October 17, 2025 at 10:21 AM
🤔 What if you could generate an entire image using just one continuous token?

💡 It works if we leverage a self-supervised representation!

Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇
October 17, 2025 at 10:21 AM
Looking forward to attending #CVPR2025 in Nashville next week 🎸🎶 @mgui7.bsky.social and I will be presenting our latest work:

🌊 Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
June 6, 2025 at 3:48 PM
Sunrise in the office after the #ICCV deadline night with @mgui7.bsky.social 🚀
March 8, 2025 at 5:46 AM