Keshav Ramji
keshavramji.bsky.social
Keshav Ramji
@keshavramji.bsky.social
Post-training Alignment at IBM Research AI | Prev: Penn CS + Wharton
(7/7) Our work highlights the potential for both principle-driven post-training approaches and self-improvement strategies, while requiring minimal human supervision.

Joint work with @tahiranaseem.bsky.social and @ramon-astudillo.bsky.social!
May 23, 2025 at 9:39 PM
(6/n) This mechanism largely retains or even improves performance while facilitating convergence to a fixed constitution over the iterations. The resulting model also enhances its self-correction abilities, successfully revising a higher rate of samples with each iteration.
May 23, 2025 at 9:36 PM
(5/n) We show that clustering the set of principles after the principle discovery phase compresses it to a human-readable constitution; this operates akin to a posterior regularization step in latent variable modeling.
May 23, 2025 at 9:36 PM
(4/n) The result is a model trained to follow its own constitution, improving on instruction-following benchmarks as well as principle-following ability over 3-4 iterations!
May 23, 2025 at 9:36 PM
(3/n) We introduce a Monte Carlo EM algorithm that alternates between the *principle discovery* and *principle learning* phases, enabling the LM to self-improve over multiple iterations, bootstrapping its learned distribution of principles to discover new ones.
May 23, 2025 at 9:34 PM
(2/n) This combines interpretability for human readers (users can see the lens along which the model is revising) with utility for the model (most useful in improving response quality).
May 23, 2025 at 9:33 PM
(1/n) These principles operate like a latent reasoning trace, bridging an initial response to a revision that improves the response along a model-determined dimension, “stepping” towards a human-written reference.
May 23, 2025 at 9:33 PM
DM me to chat about language model alignment/post-training, reasoning, self-improvement algorithms and data-centric methods, etc.
April 23, 2025 at 11:52 PM