Ben Lipkin
banner
benlipkin.bsky.social
Ben Lipkin
@benlipkin.bsky.social
phd @ mit, research @ genlm, intern @ apple

https://benlipkin.github.io/
Great question! We have not directly compared. SMC offers a test-time approach to steer off-the-shelf models without additional training, whereas diffusion forcing trains autoregressive models to more effectively sample from the global target. These strategies can actually likely be combined.
May 13, 2025 at 5:28 PM
And check out the following papers, which set the technical landscape for this work to build on:

Lew et al (2023): arxiv.org/abs/2306.03081
Loula et al (2025): arxiv.org/abs/2504.13139
May 13, 2025 at 2:22 PM
Thanks to Ben LeBrun, @vigly.bsky.social bsky.social, @joaoloula.bsky.social sky.social, @drmaciver.bsky.social aciver.bsky.social, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Tim O'Donnell, @alexlew.bsky.social ky.social, @xtimv.bsky.social

Paper Link: arxiv.org/abs/2504.05410
May 13, 2025 at 2:22 PM
Want to use AWRS SMC?

Check out the GenLM control library: github.com/genlm/genlm-...

GenLM supports not only grammars, but arbitrary programmable constraints from type systems to simulators.

If you can write a Python function, you can control your language model!
May 13, 2025 at 2:22 PM
Why does AWRS work?

Formal and empirical runtime analyses tell a fascinating story.

AWRS scales adaptively with the KL divergence between the conditional and base token-level models.

As your LM better understands the constraint, AWRS gets faster.

As the LM struggles, AWRS closes the gap.
May 13, 2025 at 2:22 PM
We tested AWRS SMC on several controlled generation tasks, from text-to-SQL to PDDL goal inference to molecular synthesis.

AWRS SMC outperforms baselines by large margins, e.g., see the jump from 3% -> 53% in the goal inference domain with only ~2.5x clock time overhead.
May 13, 2025 at 2:22 PM
Next, SMC uses the proposed extensions and corresponding weights from AWRS to update importance weights associated with partial sequences (particles).

These particles are resampled proportional to their weights, re-allocating computation towards the most promising sequences.
May 13, 2025 at 2:22 PM
First, AWRS reformulates the token-level inference problem from exact enumeration to adaptive rejection sampling.

This process yields equivalently distributed samples at a fraction of the cost.

AWRS then estimates and propagates an importance weight alongside these samples.
May 13, 2025 at 2:22 PM
So, what can we do?

AWRS SMC is a hierarchical inference framework based on sequential Monte Carlo using a novel stochastic proposal algorithm.

By jointly considering local and global signals, AWRS SMC is both probabilistically sound and sample efficient.

How does it work?
May 13, 2025 at 2:22 PM
Problem B: LCD distorts the distribution.

Consider this simple LM over the tokens `a` and `b` with the constraint that “strings must end with `a`”.

While the distribution on complete strings favors `ba`, autoregressive sampling will favor `ab`.

We don’t want this.
May 13, 2025 at 2:22 PM
Problem A: Token masking is often slow.

Must classify all 100,000+ tokens in the vocab at each step.

While regular and context-free grammars support low-overhead solutions using tools like Outlines (dottxtai.bsky.social), open-ended constraint enforcement has been harder.
May 13, 2025 at 2:22 PM
Approach 2: Locally constrained decoding (LCD).

At each step, mask the next-token distribution to prevent violations.

Pros: All samples are constraint-satisfying.
Cons: A) Masking a large vocabulary is slow. B) LCD distorts the sampled distribution.

Example:
May 13, 2025 at 2:22 PM
Approach 1: Sample-verify/Best-of-N.

Draw 𝑁 strings from the LM and use the constraint to rank/filter.

Pros: Samples 𝑥 ∝ 𝑃 as 𝑁 grows.
Cons: 𝑁 required to get a target sample scales exp(KL[𝑃||𝑄]). For difficult constraints, this becomes infeasible.

Example:
May 13, 2025 at 2:22 PM
Consider a prompted language model 𝑄 (a prior) and a constraint function 𝐶 (a likelihood).

Our goal is to sample a string 𝑥 from the conditional distribution 𝑃 = 𝑄(·|𝐶(𝑥)=1) (the target posterior).

How do people do this now, and why do current approaches fall short?
May 13, 2025 at 2:22 PM
Reposted by Ben Lipkin
Jason Eisner & Li Du’s “Syntactic and semantic control of large language models via sequential Monte Carlo” with @joaoloula.bsky.social, @benlipkin.bsky.social, @yahyaemara.bsky.social, @alexlew.bsky.social, @xtimv.bsky.social, & more presents an architecture for controlled LM generation: (11/12)
Syntactic and Semantic Control of Large Language Models via...
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as _probabilistic conditioning_, but...
openreview.net
April 21, 2025 at 4:44 PM
Big thanks to this awesome team: Ben LeBrun, @postylem.bsky.social, @joaoloula.bsky.social, @drmaciver.bsky.social, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Tim O'Donnell, @alexlew.bsky.social, @xtimv.bsky.social
April 10, 2025 at 7:19 PM