Zhaofeng Wu
zhaofengwu.bsky.social
Zhaofeng Wu
@zhaofengwu.bsky.social
PhD student @ MIT | Previously PYI @ AI2 | MS'21 BS'19 BA'19 @ UW | zhaofengwu.github.io
💡A simple method improves robustness: including an aux loss that encourages reward similarity between paraphrases ⚖️ This generalizes to improving RM perf on diverse reWordBench transformations. More surprisingly, during alignment, regularized RMs lead to better outputs too 📈
March 18, 2025 at 4:01 PM
We create a benchmark 🌟reWordBench🌟 that consists of systematically transformed instances from RewardBench that maintain their semantics/ranking 🎛 On it, all top RMs on RewardBench degrade in accuracy ⏬ regardless of their size and type (classifier vs. generative)
March 18, 2025 at 4:01 PM
E.g., all math instances in RewardBench have an artifact: the preferred responses have the results in \boxed{} and the rejected responses put the results after a `# Answer` markdown header 💀 Flipping the format 🔄 consistently degrades SOTA RM accuracy, up to >22% 📉
March 18, 2025 at 4:01 PM
Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed 🍃 🧵
March 18, 2025 at 4:01 PM
For English-centric models (analogously for others)📍1️⃣ semantically-equiv. inputs from distinct data types (e.g. English-Chinese parallel sentences; or an image & its caption) have similar repr. in intermediate transformer layers 🖇, functioning as this transmodal “semantic hub”
December 2, 2024 at 6:08 PM
Neuroscience studies posit that the human brain follows a “hub-and-spoke” model where a transmodal semantic “hub” integrates info. from modality-specific “spokes” regions 🕸 We hypothesize that LMs have a similar “semantic hub” that abstractly processes info. (fig from Ralph+17)
December 2, 2024 at 6:08 PM
💡We find that models “think” 💭 in English (or in general, their dominant language) when processing distinct non-English or even non-language data types 🤯 like texts in other languages, arithmetic expressions, code, visual inputs, & audio inputs‼️ 🧵⬇️ arxiv.org/abs/2411.04986
December 2, 2024 at 6:08 PM
November 22, 2024 at 5:56 PM