Oussama Zekri
ozekri.bsky.social
Oussama Zekri
@ozekri.bsky.social
ENS Saclay maths dpt + UW Research Intern.

Website : https://oussamazekri.fr
Blog : https://logb-research.github.io/
We fine-tuned a discrete diffusion model to respond to user prompts. In just 7k iterations (GPU poverty is real, haha), it outperforms the vanilla model ~75% of the time! 🚀
February 4, 2025 at 3:42 PM
Building on this, we can correct the gradient direction to better **follow the flow**, using the implicit function theorem (cf @mblondel.bsky.social et al., arxiv.org/abs/2105.15183 )✨

The cool part? We only need to invert a linear system, whose inverse is known in closed form! 🔥
February 4, 2025 at 3:42 PM
Inspired by Implicit Diffusion (@pierremarion.bsky.social @akorba.bsky.social @qberthet.bsky.social🤓, arxiv.org/abs/2402.05468), we sample using a specific CTMC, reaching the limiting distribution in an infinite time horizon. This effectively implements a gradient flow w.r.t. a Wasserstein metric!🔥
February 4, 2025 at 3:42 PM
SEPO, like most policy optimization algorithms, alternates between sampling and optimization. But what if sampling itself was seen as an optimization procedure in distribution space? 🚀
February 4, 2025 at 3:42 PM
If you have a discrete diffusion model (naturally designed for discrete data, e.g. language or DNA sequence modeling), you can finetune it with non-differentiable reward functions! 🎯

For example, this enables RLHF for discrete diffusion models, making alignment more flexible and powerful. ✅
February 4, 2025 at 3:42 PM
The main gradient takes the form of a weighted log concrete score, echoing DeepSeek’s unified paradigm with the weighted log policy!🔥

From this, we can reconstruct any policy gradient method for discrete diffusion models (e.g. PPO, GRPO etc...). 🚀
February 4, 2025 at 3:42 PM
The main bottleneck of Energy-Based Models is computing the normalizing constant Z.

Instead, recent discrete diffusion models skip Z by learning ratios of probabilities. This forms the concrete score, which a neural network models efficiently!⚡

The challenge? Using this score network as a policy.
February 4, 2025 at 3:42 PM
🚀 Policy gradient methods like DeepSeek’s GRPO are great for finetuning LLMs via RLHF.

But what happens when we swap autoregressive generation for discrete diffusion, a rising architecture promising faster & more controllable LLMs?

Introducing SEPO !

📑 arxiv.org/pdf/2502.01384

🧵👇
February 4, 2025 at 3:42 PM
💡For a Markov chain with d states, the LLM-based method achieves an error rate of O(log⁡(d)/N).

The frequentist approach, which is minimax optimal, achieves O(d/N). (see Wolfer et al., 2019, arxiv.org/pdf/1902.00080).

This makes it particularly efficient for MC with a large number of states! 🌟
November 26, 2024 at 2:52 PM
‼️What’s even better is that you can derive bounds on the estimation error based on the number of samples N provided and specific properties of the Markov chain.

Tested and validated on recent LLMs!
November 26, 2024 at 2:52 PM
🚀 Did you know you can use the in-context learning abilities of an LLM to estimate the transition probabilities of a Markov chains?

The results are pretty exciting ! 😄
November 26, 2024 at 2:52 PM