Lightnews — Scholar-powered news

@joh-schb.bsky.social

This work was co-led by @mgui7.bsky.social and me and wouldn't have been possible without the help of all the other collaborators: @timyphan.bsky.social, Felix Krause, @kindsuss.bsky.social, @itsbautistam.bsky.social, and Björn Ommer.

A big thank you to all of them🙏

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

RepTok merges representation learning & generation

A self-supervised token becomes the latent of a generative model.
It’s efficient, continuous, and geometry-preserving - no quantization, no attention overhead.

Check it out
💻 github.com/CompVis/RepTok
📄 arxiv.org/abs/2510.14630

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

Ablations show:
• Works across SSL encoders (DINOv2 best, CLIP & MAE close)
• Cosine-similarity loss balances fidelity vs generativity
• Without SSL priors → reconstructions good, generations collapse

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

RepTok’s geometry stays smooth: linear interpolations in latent space yield natural transitions in both shape and semantics.

This shows that the single-token latent preserves structured continuity - not just abstract semantics.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

Even with limited training budget we still reach competitive zero-shot FID on MS-COCO - rivaling much larger diffusion models.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

We also extend RepTok to text-to-image generation using cross-attention to embeddings of frozen language models.

Training: <20 h on 4×A100 GPUs.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

For generation, we model the latent space directly with an MLP-Mixer (no attention at all!).

Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.

Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

Despite using just one token (dim ~768), RepTok reconstructs images faithfully and achieves:

📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9

That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

🔑 The key to making it all work:

We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!

❗️This keeps it semantically structured yet reconstruction-aware.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

💡 The idea

We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.

Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

💸 Current diffusion or flow models operate on redundant and expensive 2D latent grids...

VAEs, diffusion AEs, or tokenizers use large number of latent tokens / patches.

But images often share structure that could be represented compactly!

October 17, 2025 at 10:21 AM

Johannes Schusterbauer

@joh-schb.bsky.social

If you are interested, feel free to check the paper (arxiv.org/abs/2506.02221) or come by at CVPR:

📌 Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208

Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM mode...

arxiv.org

June 6, 2025 at 3:48 PM

Johannes Schusterbauer

@joh-schb.bsky.social

It's a framework that bridges Diffusion and Flow Matching paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields. This enables efficient FM finetuning of diffusion priors, retaining their knowledge while giving us the benefits of Flow Matching 🚀

June 6, 2025 at 3:48 PM

Johannes Schusterbauer

@joh-schb.bsky.social

Hi, would be happy to be on that list as well. Working on Diffusion & Flow matching at @compvis.bsky.social under the supervision of Björn Ommer..

November 25, 2024 at 2:14 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news