Working on diffusion & flow models🫶
This shows that the single-token latent preserves structured continuity - not just abstract semantics.
This shows that the single-token latent preserves structured continuity - not just abstract semantics.
Training: <20 h on 4×A100 GPUs.
Training: <20 h on 4×A100 GPUs.
Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.
Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.
Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9
That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9
That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!
❗️This keeps it semantically structured yet reconstruction-aware.
We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!
❗️This keeps it semantically structured yet reconstruction-aware.
We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.
Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.
Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
💡 It works if we leverage a self-supervised representation!
Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇
💡 It works if we leverage a self-supervised representation!
Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇
🌊 Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
🌊 Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment