Serina Chang
serinachang5.bsky.social
Serina Chang
@serinachang5.bsky.social
Incoming Assistant Professor at UC Berkeley in CS and Computational Precision Health. Postdoc at Microsoft Research, PhD in CS at Stanford. Research in AI, graphs, public health, and computational social science.

https://serinachang5.github.io/
Check out ChatBench online and see our paper for analyses of the user-AI conversations! Thanks to my fantastic collaborators at
@msftresearch.bsky.social, @ashtonanderson.bsky.social and @jakehofman.bsky.social! 9/

ChatBench: huggingface.co/datasets/mic...
Paper: arxiv.org/abs/2504.07114
April 11, 2025 at 5:57 PM
Fine-tuning greatly improves the simulator’s ability to estimate real user-AI accs, increasing correlation on unseen questions by >20 pts. Our results demonstrate the promise of simulation to scale interactive eval, but also the need to test simulators on real human behavior. 8/
April 11, 2025 at 5:57 PM
These results motivate the need to incorporate human interaction into AI evaluation. However, how do we do this at scale? We propose an LLM-based user simulator and transform user-AI conversations + answers from ChatBench into supervised fine-tuning data for the simulator. 7/
April 11, 2025 at 5:57 PM
Across subjects, models, AI-alone methods, and user-AI conditions, AI-alone fails to predict user-AI accuracy. Letter-only is especially bad with a mean gap of 21 pts. Free-text is better, with a mean gap of 10 pts, but still differs significantly from user-AI in many cases. 5/
April 11, 2025 at 5:57 PM
For AI-alone, we test (1) common letter-only methods that require the model to answer with a single letter, (2) our free-text method where the model responds without any constraints, then GPT-4o extracts an answer, simulating a user copy+pasting then looking for an answer. 4/
April 11, 2025 at 5:57 PM
We conduct a large-scale user study on Prolific, collecting data across 5 MMLU datasets (physics, moral reasoning, three levels of math), 2 models (GPT-4o & Llama-3.1-8b), and 2 user-AI conditions (user answers first vs directly with AI) → 7.3K user-AI conversations. 3/
April 11, 2025 at 5:57 PM
Standard benchmarks test AI on its own (“AI-alone”) on static questions, missing human variability, interactivity, and writing style. We bring benchmarks to life by seeding human users with the benchmark question and having them interact with the LLM to answer their question. 2/
April 11, 2025 at 5:57 PM
1st post on bsky!

What happens when a static benchmark comes to life? ✨ Introducing ChatBench, a large-scale user study where we *converted* MMLU questions into thousands of user-AI conversations. Then, we trained a user simulator on ChatBench to generate user-AI outcomes on unseen questions. 1/ 🧵
April 11, 2025 at 5:57 PM