Learning representations, minimizing free energy, running.
@mnoukhov.bsky.social & @AaronCourville🙏
@mnoukhov.bsky.social & @AaronCourville🙏
💾 github.com/lavoiems/Dis...
📜https://arxiv.org/pdf/2507.12318
Reproduce, remix, build your own DLC-powered models.
💾 github.com/lavoiems/Dis...
📜https://arxiv.org/pdf/2507.12318
Reproduce, remix, build your own DLC-powered models.
We verify this via nearest-neighbor search in DinoV2 space.
But our model can still create them—by composing concepts it learned separately.
We verify this via nearest-neighbor search in DinoV2 space.
But our model can still create them—by composing concepts it learned separately.
We fine-tune a language model to sample DLC tokens from text, giving us a pipeline:
Text → DLC → Image
This also enables generation beyond ImageNet.
We fine-tune a language model to sample DLC tokens from text, giving us a pipeline:
Text → DLC → Image
This also enables generation beyond ImageNet.
Swap tokens between two images (🐕 Komodor + 🍝 Carbonara) → the model produces coherent hybrids never seen during training.
Swap tokens between two images (🐕 Komodor + 🍝 Carbonara) → the model produces coherent hybrids never seen during training.
DiT-XL/2 + DLC → FID 1.59 on unconditional ImageNet
Works well with and without classifier-free guidance
Learns faster and better than prior works using pre-trained encoders
🤯
DiT-XL/2 + DLC → FID 1.59 on unconditional ImageNet
Works well with and without classifier-free guidance
Learns faster and better than prior works using pre-trained encoders
🤯
Sample a DLC (e.g., with SEDD)
Decode it into an image (e.g., with DiT)
This ancestral sampling approach is simple but powerful.
Sample a DLC (e.g., with SEDD)
Decode it into an image (e.g., with DiT)
This ancestral sampling approach is simple but powerful.
Images → sequences of discrete tokens via a Simplicial Embedding (SEM) encoder
We take the argmax over token distributions → get the DLC sequence
Think of it as “tokenizing” images—like words for LLMs.
Images → sequences of discrete tokens via a Simplicial Embedding (SEM) encoder
We take the argmax over token distributions → get the DLC sequence
Think of it as “tokenizing” images—like words for LLMs.
So… can we improve image generation of highly-modal distributions by decomposing it into:
1. Generating discrete tokens - p(c)
2. Decoding tokens into images - p(x|c)
So… can we improve image generation of highly-modal distributions by decomposing it into:
1. Generating discrete tokens - p(c)
2. Decoding tokens into images - p(x|c)
Even a simple 2D Gaussian mixture with a large number of modes may be tricky to model directly. Good conditioning solves this!
Could this be why large image generative models are almost always conditional? 🤔
Even a simple 2D Gaussian mixture with a large number of modes may be tricky to model directly. Good conditioning solves this!
Could this be why large image generative models are almost always conditional? 🤔