Samuel Lavoie
lavoiems.bsky.social
Samuel Lavoie
@lavoiems.bsky.social
PhD candidate @Mila_quebec, @UMontreal. Ex: FAIR @AIatMeta.
Learning representations, minimizing free energy, running.
This work wouldn’t exist without my amazing co-authors:
@mnoukhov.bsky.social & @AaronCourville🙏
July 22, 2025 at 2:41 PM
Code & Models are open source:
💾 github.com/lavoiems/Dis...
📜https://arxiv.org/pdf/2507.12318

Reproduce, remix, build your own DLC-powered models.
GitHub - lavoiems/DiscreteLatentCode: Official repository for the article Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (https://arxiv.org/abs/2507.12318)
Official repository for the article Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (https://arxiv.org/abs/2507.12318) - lavoiems/DiscreteLatentCode
github.com
July 22, 2025 at 2:41 PM
Example: There are no “teapots on mountains” in ImageNet.

We verify this via nearest-neighbor search in DinoV2 space.
But our model can still create them—by composing concepts it learned separately.
July 22, 2025 at 2:41 PM
LLMs can speak in DLC!

We fine-tune a language model to sample DLC tokens from text, giving us a pipeline:
Text → DLC → Image
This also enables generation beyond ImageNet.
July 22, 2025 at 2:41 PM
DLCs are compositional.
Swap tokens between two images (🐕 Komodor + 🍝 Carbonara) → the model produces coherent hybrids never seen during training.
July 22, 2025 at 2:41 PM
🚀 Results:

DiT-XL/2 + DLC → FID 1.59 on unconditional ImageNet

Works well with and without classifier-free guidance

Learns faster and better than prior works using pre-trained encoders

🤯
July 22, 2025 at 2:41 PM
Unconditional generation pipeline:
Sample a DLC (e.g., with SEDD)

Decode it into an image (e.g., with DiT)

This ancestral sampling approach is simple but powerful.
July 22, 2025 at 2:41 PM
DLCs enables exactly this.
Images → sequences of discrete tokens via a Simplicial Embedding (SEM) encoder

We take the argmax over token distributions → get the DLC sequence

Think of it as “tokenizing” images—like words for LLMs.
July 22, 2025 at 2:41 PM
Text models don’t have this problem! LLMs can model internet scale corpus.

So… can we improve image generation of highly-modal distributions by decomposing it into:

1. Generating discrete tokens - p(c)
2. Decoding tokens into images - p(x|c)
July 22, 2025 at 2:41 PM
Modeling highly multimodal distributions in continuous space is hard.
Even a simple 2D Gaussian mixture with a large number of modes may be tricky to model directly. Good conditioning solves this!

Could this be why large image generative models are almost always conditional? 🤔
July 22, 2025 at 2:41 PM
Congrats Lucas! Looking forward to see what will come out of your lab in Zurich!
December 5, 2024 at 12:55 PM