Working on diffusion & flow models🫶
A big thank you to all of them🙏
A big thank you to all of them🙏
A self-supervised token becomes the latent of a generative model.
It’s efficient, continuous, and geometry-preserving - no quantization, no attention overhead.
Check it out
💻 github.com/CompVis/RepTok
📄 arxiv.org/abs/2510.14630
A self-supervised token becomes the latent of a generative model.
It’s efficient, continuous, and geometry-preserving - no quantization, no attention overhead.
Check it out
💻 github.com/CompVis/RepTok
📄 arxiv.org/abs/2510.14630
• Works across SSL encoders (DINOv2 best, CLIP & MAE close)
• Cosine-similarity loss balances fidelity vs generativity
• Without SSL priors → reconstructions good, generations collapse
• Works across SSL encoders (DINOv2 best, CLIP & MAE close)
• Cosine-similarity loss balances fidelity vs generativity
• Without SSL priors → reconstructions good, generations collapse
This shows that the single-token latent preserves structured continuity - not just abstract semantics.
This shows that the single-token latent preserves structured continuity - not just abstract semantics.
Training: <20 h on 4×A100 GPUs.
Training: <20 h on 4×A100 GPUs.
Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.
Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.
Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9
That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9
That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!
❗️This keeps it semantically structured yet reconstruction-aware.
We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!
❗️This keeps it semantically structured yet reconstruction-aware.
We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.
Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.
Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
VAEs, diffusion AEs, or tokenizers use large number of latent tokens / patches.
But images often share structure that could be represented compactly!
VAEs, diffusion AEs, or tokenizers use large number of latent tokens / patches.
But images often share structure that could be represented compactly!
📌 Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208
📌 Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208