Lightnews — Scholar-powered news

Zhaofeng Wu

@zhaofengwu.bsky.social

380 followers 120 following 15 posts

PhD student @ MIT | Previously PYI @ AI2 | MS'21 BS'19 BA'19 @ UW | zhaofengwu.github.io

Posts Replies Media Videos

Zhaofeng Wu

@zhaofengwu.bsky.social

💡A simple method improves robustness: including an aux loss that encourages reward similarity between paraphrases ⚖️ This generalizes to improving RM perf on diverse reWordBench transformations. More surprisingly, during alignment, regularized RMs lead to better outputs too 📈

March 18, 2025 at 4:01 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

We create a benchmark 🌟reWordBench🌟 that consists of systematically transformed instances from RewardBench that maintain their semantics/ranking 🎛 On it, all top RMs on RewardBench degrade in accuracy ⏬ regardless of their size and type (classifier vs. generative)

March 18, 2025 at 4:01 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

E.g., all math instances in RewardBench have an artifact: the preferred responses have the results in \boxed{} and the rejected responses put the results after a `# Answer` markdown header 💀 Flipping the format 🔄 consistently degrades SOTA RM accuracy, up to >22% 📉

March 18, 2025 at 4:01 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed 🍃 🧵

March 18, 2025 at 4:01 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

For English-centric models (analogously for others)📍1️⃣ semantically-equiv. inputs from distinct data types (e.g. English-Chinese parallel sentences; or an image & its caption) have similar repr. in intermediate transformer layers 🖇, functioning as this transmodal “semantic hub”

December 2, 2024 at 6:08 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

Neuroscience studies posit that the human brain follows a “hub-and-spoke” model where a transmodal semantic “hub” integrates info. from modality-specific “spokes” regions 🕸 We hypothesize that LMs have a similar “semantic hub” that abstractly processes info. (fig from Ralph+17)

December 2, 2024 at 6:08 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

💡We find that models “think” 💭 in English (or in general, their dominant language) when processing distinct non-English or even non-language data types 🤯 like texts in other languages, arithmetic expressions, code, visual inputs, & audio inputs‼️ 🧵⬇️ arxiv.org/abs/2411.04986

December 2, 2024 at 6:08 PM

Zhaofeng Wu

@zhaofengwu.bsky.social

November 22, 2024 at 5:56 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news