Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi
Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.🚀
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.🚀
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
That’s it.
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
That’s it.
🔗 A powerful new method for generative image modeling that bridges generation and representation learning.
⚡️Brings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.
🔗 A powerful new method for generative image modeling that bridges generation and representation learning.
⚡️Brings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis
Paper: arxiv.org/abs/2502.09509
Code: github.com/zelaki/eqvae
HuggingFace Model: huggingface.co/zelaki/eq-va...
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis
Paper: arxiv.org/abs/2502.09509
Code: github.com/zelaki/eqvae
HuggingFace Model: huggingface.co/zelaki/eq-va...
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
🔹 This makes the latent space simpler and easier to model.
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
🔹 This makes the latent space simpler and easier to model.
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)
✅ Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
✅ Discrete autoencoders (VQ-GAN)
✅ Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
✅ Discrete autoencoders (VQ-GAN)
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
✅ If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
✅ If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.
✅ 7× faster training convergence on DiT-XL/2
✅ 4× faster training on REPA
✅ 7× faster training convergence on DiT-XL/2
✅ 4× faster training on REPA
🔹Smoother latent space = easier to model & better generative performance.
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!
🔹Smoother latent space = easier to model & better generative performance.
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!