Brad Knox
bradknox.bsky.social
Brad Knox
@bradknox.bsky.social
Research Associate Professor in CS at UT Austin. I research how humans can specify aligned reward functions.
Giving such shorter-horizon feedback does tend to result in more varied reward rewards. And this variation bears resemblance to the meaning of the word dense, which I suspect is the origin of this misnomer. (4/n)
February 24, 2025 at 5:27 PM
I find that what people really mean by "dense" is that so-called denser reward functions are giving feedback on *recent* state-action pairs, thus reducing the credit assignment problem (at some risk of misalignment). (3/n)
February 24, 2025 at 5:27 PM
In standard RL, all reward functions give reward at every time step. A reward of 0 is informative, as is a reward of -1. So all reward functions are dense. (2/n)
February 24, 2025 at 5:27 PM
Work led by Stephane Hatgis-Kessell, in collaboration with @reniebird.bsky.social, @scottniekum.bsky.social, Peter Stone, and me. Full paper: arxiv.org/pdf/2501.06416
arxiv.org
January 14, 2025 at 11:51 PM
Our findings suggest that human training and preference elicitation interfaces are essential tools for improving alignment in RLHF. The interventions of studies 2 and 3 can be applied for real-world application and suggest fundamentally new methods for model alignment (8/n)
January 14, 2025 at 11:51 PM
Study 3: Simply changing the question asked during preference elicitation. (7/n)
January 14, 2025 at 11:51 PM
Study 2: Training people to follow a specific preference model. (6/n)
January 14, 2025 at 11:51 PM
Study 1 intervention: Show humans the quantities that underlie a preference model---normally unobservable information derived from the reward function. (5/n)
January 14, 2025 at 11:51 PM
Every study involved a control and two intervention conditions that attempted to influence humans towards a preference model. For all 3 studies each intervention: (1) significantly influenced humans toward a preference model (2) led to learning more aligned reward functions (4/n)
January 14, 2025 at 11:51 PM
We answer this question with 3 human studies. Without trying to alter the human's unobserved reward function, we change how humans use this reward function to generate preferences so that they better match the preference model assumed by an RLHF algorithm. (3/n)
January 14, 2025 at 11:51 PM
RLHF algorithms typically require a model of how humans generate preferences. But a poor model of humans risks learning a poor approximation of the human’s unobservable reward function. We ask: *Can we influence humans to more closely conform to a desired preference model?* (2/n)
January 14, 2025 at 11:51 PM