Lightnews — Scholar-powered news

Yuchen Zhu

@zhuyuchen.bsky.social

40 followers 67 following 13 posts

Machine Learning PhD student @UCL. Interested in Causality and AI Safety.

yuchen-zhu.github.io

Posts Replies Media Videos

Yuchen Zhu

@zhuyuchen.bsky.social

The proxy reward coming from this satisfies our conditions; we include empirical results showing improvement when learned with this proxy reward in the upcoming camera ready version.

n/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

Apart from the example given, there are also a lot of natural frameworks satisfying our conditions. For example, increased temperature from tempered softmax causes bias in learning reward functions.

9/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

4. If the expert judges two symptoms as similar, the trainee must also judge the two symptoms as similar except up to some relaxation constant L; similarity is measured by a metric between distributions mapped by the policies. Our sample complexity improvement depends on L.

8/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

3. There exist a low-dimensional encoding of the image of the proxy policy satisfying some smoothness conditions. Note that this is standard in the majority of machine learning tasks.

7/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

Crucially it's not necessary that D1=D2.
2. The proxy's image must contain that of the true. This essentially means that all the possible diseases D diagnosable by the expert can also be diagnosed by the trainee, though the trainee may map the wrong symptom to a given D.

6/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

1. The proxy and true policies must share level sets: using trainee doctors as proxies for expert doctors, then whenever the expert judges two distinct symptoms S1, S2 to indicate the same disease D1, the trainee also judge S1, S2 to indicate the same disease D2.

5/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

Our work is the first to provide a theoretical foundation of using cheap but noisy rewards for preference learning of large generative models.
What do our technical conditions essentially say?

4/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

Crucially, we prove that under our conditions the true policy is given by a low-dimensional adaptation of the proxy policy. This leads to a significant sample complexity improvement which we formally prove using PAC theory.

3/n

May 1, 2025 at 3:33 PM

Yuchen Zhu

@zhuyuchen.bsky.social

Our work discusses sufficient conditions under which proxy rewards can be used to improve the learning of the underlying true policy in preference learning algorithms.

2/n

May 1, 2025 at 3:33 PM