Lightnews — Scholar-powered news

Brad Knox

@bradknox.bsky.social

1.4K followers 29 following 18 posts

Research Associate Professor in CS at UT Austin. I research how humans can specify aligned reward functions.

Posts Replies Media Videos

Brad Knox

@bradknox.bsky.social

Context: www.nytimes.com/2025/02/27/t... and x.com/karpathy/sta...

Not a Coder? With A.I., Just Having an Idea Can Be Enough.

I’m not a programmer. But I’ve been creating my own software tools with help from artificial intelligence.

www.nytimes.com

March 4, 2025 at 8:49 PM

Brad Knox

@bradknox.bsky.social

Giving such shorter-horizon feedback does tend to result in more varied reward rewards. And this variation bears resemblance to the meaning of the word dense, which I suspect is the origin of this misnomer. (4/n)

February 24, 2025 at 5:27 PM

Brad Knox

@bradknox.bsky.social

I find that what people really mean by "dense" is that so-called denser reward functions are giving feedback on *recent* state-action pairs, thus reducing the credit assignment problem (at some risk of misalignment). (3/n)

February 24, 2025 at 5:27 PM

Brad Knox

@bradknox.bsky.social

In standard RL, all reward functions give reward at every time step. A reward of 0 is informative, as is a reward of -1. So all reward functions are dense. (2/n)

February 24, 2025 at 5:27 PM

Brad Knox

@bradknox.bsky.social

Work led by Stephane Hatgis-Kessell, in collaboration with @reniebird.bsky.social, @scottniekum.bsky.social, Peter Stone, and me. Full paper: arxiv.org/pdf/2501.06416

arxiv.org

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

Our findings suggest that human training and preference elicitation interfaces are essential tools for improving alignment in RLHF. The interventions of studies 2 and 3 can be applied for real-world application and suggest fundamentally new methods for model alignment (8/n)

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

Study 3: Simply changing the question asked during preference elicitation. (7/n)

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

Study 2: Training people to follow a specific preference model. (6/n)

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

Study 1 intervention: Show humans the quantities that underlie a preference model---normally unobservable information derived from the reward function. (5/n)

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

Every study involved a control and two intervention conditions that attempted to influence humans towards a preference model. For all 3 studies each intervention: (1) significantly influenced humans toward a preference model (2) led to learning more aligned reward functions (4/n)

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

We answer this question with 3 human studies. Without trying to alter the human's unobserved reward function, we change how humans use this reward function to generate preferences so that they better match the preference model assumed by an RLHF algorithm. (3/n)

January 14, 2025 at 11:51 PM

Brad Knox

@bradknox.bsky.social

RLHF algorithms typically require a model of how humans generate preferences. But a poor model of humans risks learning a poor approximation of the human’s unobservable reward function. We ask: *Can we influence humans to more closely conform to a desired preference model?* (2/n)

January 14, 2025 at 11:51 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news