Lightnews — Scholar-powered news

Oussama Zekri

@ozekri.bsky.social

61 followers 120 following 20 posts

ENS Saclay maths dpt + UW Research Intern.

Website : https://oussamazekri.fr
Blog : https://logb-research.github.io/

Posts Replies Media Videos

Oussama Zekri

@ozekri.bsky.social

We fine-tuned a discrete diffusion model to respond to user prompts. In just 7k iterations (GPU poverty is real, haha), it outperforms the vanilla model ~75% of the time! 🚀

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

Building on this, we can correct the gradient direction to better **follow the flow**, using the implicit function theorem (cf @mblondel.bsky.social et al., arxiv.org/abs/2105.15183 )✨

The cool part? We only need to invert a linear system, whose inverse is known in closed form! 🔥

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

Inspired by Implicit Diffusion (@pierremarion.bsky.social @akorba.bsky.social @qberthet.bsky.social🤓, arxiv.org/abs/2402.05468), we sample using a specific CTMC, reaching the limiting distribution in an infinite time horizon. This effectively implements a gradient flow w.r.t. a Wasserstein metric!🔥

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

SEPO, like most policy optimization algorithms, alternates between sampling and optimization. But what if sampling itself was seen as an optimization procedure in distribution space? 🚀

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

If you have a discrete diffusion model (naturally designed for discrete data, e.g. language or DNA sequence modeling), you can finetune it with non-differentiable reward functions! 🎯

For example, this enables RLHF for discrete diffusion models, making alignment more flexible and powerful. ✅

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

The main gradient takes the form of a weighted log concrete score, echoing DeepSeek’s unified paradigm with the weighted log policy!🔥

From this, we can reconstruct any policy gradient method for discrete diffusion models (e.g. PPO, GRPO etc...). 🚀

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

The main bottleneck of Energy-Based Models is computing the normalizing constant Z.

Instead, recent discrete diffusion models skip Z by learning ratios of probabilities. This forms the concrete score, which a neural network models efficiently!⚡

The challenge? Using this score network as a policy.

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

🚀 Policy gradient methods like DeepSeek’s GRPO are great for finetuning LLMs via RLHF.

But what happens when we swap autoregressive generation for discrete diffusion, a rising architecture promising faster & more controllable LLMs?

Introducing SEPO !

📑 arxiv.org/pdf/2502.01384

🧵👇

February 4, 2025 at 3:42 PM

Oussama Zekri

@ozekri.bsky.social

💡For a Markov chain with d states, the LLM-based method achieves an error rate of O(log⁡(d)/N).

The frequentist approach, which is minimax optimal, achieves O(d/N). (see Wolfer et al., 2019, arxiv.org/pdf/1902.00080).

This makes it particularly efficient for MC with a large number of states! 🌟

November 26, 2024 at 2:52 PM

Oussama Zekri

@ozekri.bsky.social

‼️What’s even better is that you can derive bounds on the estimation error based on the number of samples N provided and specific properties of the Markov chain.

Tested and validated on recent LLMs!

November 26, 2024 at 2:52 PM

Oussama Zekri

@ozekri.bsky.social

🚀 Did you know you can use the in-context learning abilities of an LLM to estimate the transition probabilities of a Markov chains?

The results are pretty exciting ! 😄

November 26, 2024 at 2:52 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news