Lightnews — Scholar-powered news

Johannes Schusterbauer

@joh-schb.bsky.social

320 followers 330 following 19 posts

PhD Student @ CompVis group, LMU Munich
Working on diffusion & flow models🫶

Posts Replies Media Videos

Johannes Schusterbauer

@joh-schb.bsky.social

RepTok’s geometry stays smooth: linear interpolations in latent space yield natural transitions in both shape and semantics.

This shows that the single-token latent preserves structured continuity - not just abstract semantics.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

Even with limited training budget we still reach competitive zero-shot FID on MS-COCO - rivaling much larger diffusion models.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

We also extend RepTok to text-to-image generation using cross-attention to embeddings of frozen language models.

Training: <20 h on 4×A100 GPUs.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

For generation, we model the latent space directly with an MLP-Mixer (no attention at all!).

Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.

Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

Despite using just one token (dim ~768), RepTok reconstructs images faithfully and achieves:

📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9

That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

🔑 The key to making it all work:

We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!

❗️This keeps it semantically structured yet reconstruction-aware.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

💡 The idea

We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.

Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

🤔 What if you could generate an entire image using just one continuous token?

💡 It works if we leverage a self-supervised representation!

Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

Looking forward to attending #CVPR2025 in Nashville next week 🎸🎶 @mgui7.bsky.social and I will be presenting our latest work:

🌊 Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

June 6, 2025 at 3:48 PM

Johannes Schusterbauer

@joh-schb.bsky.social

Sunrise in the office after the #ICCV deadline night with @mgui7.bsky.social 🚀

March 8, 2025 at 5:46 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news