Lightnews — Scholar-powered news

Keshav Ramji

@keshavramji.bsky.social

900 followers 230 following 12 posts

Post-training Alignment at IBM Research AI | Prev: Penn CS + Wharton

Posts Replies Media Videos

Keshav Ramji

@keshavramji.bsky.social

(7/7) Our work highlights the potential for both principle-driven post-training approaches and self-improvement strategies, while requiring minimal human supervision.

Joint work with @tahiranaseem.bsky.social and @ramon-astudillo.bsky.social!

May 23, 2025 at 9:39 PM

Keshav Ramji

@keshavramji.bsky.social

(6/n) This mechanism largely retains or even improves performance while facilitating convergence to a fixed constitution over the iterations. The resulting model also enhances its self-correction abilities, successfully revising a higher rate of samples with each iteration.

May 23, 2025 at 9:36 PM

Keshav Ramji

@keshavramji.bsky.social

(5/n) We show that clustering the set of principles after the principle discovery phase compresses it to a human-readable constitution; this operates akin to a posterior regularization step in latent variable modeling.

May 23, 2025 at 9:36 PM

Keshav Ramji

@keshavramji.bsky.social

(4/n) The result is a model trained to follow its own constitution, improving on instruction-following benchmarks as well as principle-following ability over 3-4 iterations!

May 23, 2025 at 9:36 PM

Keshav Ramji

@keshavramji.bsky.social

(3/n) We introduce a Monte Carlo EM algorithm that alternates between the *principle discovery* and *principle learning* phases, enabling the LM to self-improve over multiple iterations, bootstrapping its learned distribution of principles to discover new ones.

May 23, 2025 at 9:34 PM

Keshav Ramji

@keshavramji.bsky.social

(2/n) This combines interpretability for human readers (users can see the lens along which the model is revising) with utility for the model (most useful in improving response quality).

May 23, 2025 at 9:33 PM

Keshav Ramji

@keshavramji.bsky.social

(1/n) These principles operate like a latent reasoning trace, bridging an initial response to a revision that improves the response along a model-determined dimension, “stepping” towards a human-written reference.

May 23, 2025 at 9:33 PM

Keshav Ramji

@keshavramji.bsky.social

DM me to chat about language model alignment/post-training, reasoning, self-improvement algorithms and data-centric methods, etc.

April 23, 2025 at 11:52 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news