Lightnews — Scholar-powered news

Ben Lipkin

@benlipkin.bsky.social

Great question! We have not directly compared. SMC offers a test-time approach to steer off-the-shelf models without additional training, whereas diffusion forcing trains autoregressive models to more effectively sample from the global target. These strategies can actually likely be combined.

May 13, 2025 at 5:28 PM

Ben Lipkin

@benlipkin.bsky.social

And check out the following papers, which set the technical landscape for this work to build on:

Lew et al (2023): arxiv.org/abs/2306.03081
Loula et al (2025): arxiv.org/abs/2504.13139

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Thanks to Ben LeBrun, @vigly.bsky.social bsky.social, @joaoloula.bsky.social sky.social, @drmaciver.bsky.social aciver.bsky.social, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Tim O'Donnell, @alexlew.bsky.social ky.social, @xtimv.bsky.social

Paper Link: arxiv.org/abs/2504.05410

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Want to use AWRS SMC?

Check out the GenLM control library: github.com/genlm/genlm-...

GenLM supports not only grammars, but arbitrary programmable constraints from type systems to simulators.

If you can write a Python function, you can control your language model!

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Why does AWRS work?

Formal and empirical runtime analyses tell a fascinating story.

AWRS scales adaptively with the KL divergence between the conditional and base token-level models.

As your LM better understands the constraint, AWRS gets faster.

As the LM struggles, AWRS closes the gap.

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

We tested AWRS SMC on several controlled generation tasks, from text-to-SQL to PDDL goal inference to molecular synthesis.

AWRS SMC outperforms baselines by large margins, e.g., see the jump from 3% -> 53% in the goal inference domain with only ~2.5x clock time overhead.

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Next, SMC uses the proposed extensions and corresponding weights from AWRS to update importance weights associated with partial sequences (particles).

These particles are resampled proportional to their weights, re-allocating computation towards the most promising sequences.

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

First, AWRS reformulates the token-level inference problem from exact enumeration to adaptive rejection sampling.

This process yields equivalently distributed samples at a fraction of the cost.

AWRS then estimates and propagates an importance weight alongside these samples.

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

So, what can we do?

AWRS SMC is a hierarchical inference framework based on sequential Monte Carlo using a novel stochastic proposal algorithm.

By jointly considering local and global signals, AWRS SMC is both probabilistically sound and sample efficient.

How does it work?

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Problem B: LCD distorts the distribution.

Consider this simple LM over the tokens `a` and `b` with the constraint that “strings must end with `a`”.

While the distribution on complete strings favors `ba`, autoregressive sampling will favor `ab`.

We don’t want this.

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Problem A: Token masking is often slow.

Must classify all 100,000+ tokens in the vocab at each step.

While regular and context-free grammars support low-overhead solutions using tools like Outlines (dottxtai.bsky.social), open-ended constraint enforcement has been harder.

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Approach 2: Locally constrained decoding (LCD).

At each step, mask the next-token distribution to prevent violations.

Pros: All samples are constraint-satisfying.
Cons: A) Masking a large vocabulary is slow. B) LCD distorts the sampled distribution.

Example:

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Approach 1: Sample-verify/Best-of-N.

Draw 𝑁 strings from the LM and use the constraint to rank/filter.

Pros: Samples 𝑥 ∝ 𝑃 as 𝑁 grows.
Cons: 𝑁 required to get a target sample scales exp(KL[𝑃||𝑄]). For difficult constraints, this becomes infeasible.

Example:

May 13, 2025 at 2:22 PM

Ben Lipkin

@benlipkin.bsky.social

Consider a prompted language model 𝑄 (a prior) and a constraint function 𝐶 (a likelihood).

Our goal is to sample a string 𝑥 from the conditional distribution 𝑃 = 𝑄(·|𝐶(𝑥)=1) (the target posterior).

How do people do this now, and why do current approaches fall short?

May 13, 2025 at 2:22 PM

Reposted by Ben Lipkin

JHU Computer Science

@jhucompsci.bsky.social

Jason Eisner & Li Du’s “Syntactic and semantic control of large language models via sequential Monte Carlo” with @joaoloula.bsky.social, @benlipkin.bsky.social, @yahyaemara.bsky.social, @alexlew.bsky.social, @xtimv.bsky.social, & more presents an architecture for controlled LM generation: (11/12)