Lightnews — Scholar-powered news

Serina Chang

@serinachang5.bsky.social

Incoming Assistant Professor at UC Berkeley in CS and Computational Precision Health. Postdoc at Microsoft Research, PhD in CS at Stanford. Research in AI, graphs, public health, and computational social science.

https://serinachang5.github.io/

Posts Replies Media Videos

Serina Chang

@serinachang5.bsky.social

Check out ChatBench online and see our paper for analyses of the user-AI conversations! Thanks to my fantastic collaborators at
@msftresearch.bsky.social, @ashtonanderson.bsky.social and @jakehofman.bsky.social! 9/

ChatBench: huggingface.co/datasets/mic...
Paper: arxiv.org/abs/2504.07114

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

Fine-tuning greatly improves the simulator’s ability to estimate real user-AI accs, increasing correlation on unseen questions by >20 pts. Our results demonstrate the promise of simulation to scale interactive eval, but also the need to test simulators on real human behavior. 8/

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

These results motivate the need to incorporate human interaction into AI evaluation. However, how do we do this at scale? We propose an LLM-based user simulator and transform user-AI conversations + answers from ChatBench into supervised fine-tuning data for the simulator. 7/

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

Across subjects, models, AI-alone methods, and user-AI conditions, AI-alone fails to predict user-AI accuracy. Letter-only is especially bad with a mean gap of 21 pts. Free-text is better, with a mean gap of 10 pts, but still differs significantly from user-AI in many cases. 5/

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

For AI-alone, we test (1) common letter-only methods that require the model to answer with a single letter, (2) our free-text method where the model responds without any constraints, then GPT-4o extracts an answer, simulating a user copy+pasting then looking for an answer. 4/

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

We conduct a large-scale user study on Prolific, collecting data across 5 MMLU datasets (physics, moral reasoning, three levels of math), 2 models (GPT-4o & Llama-3.1-8b), and 2 user-AI conditions (user answers first vs directly with AI) → 7.3K user-AI conversations. 3/

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

Standard benchmarks test AI on its own (“AI-alone”) on static questions, missing human variability, interactivity, and writing style. We bring benchmarks to life by seeding human users with the benchmark question and having them interact with the LLM to answer their question. 2/

April 11, 2025 at 5:57 PM

Serina Chang

@serinachang5.bsky.social

1st post on bsky!

What happens when a static benchmark comes to life? ✨ Introducing ChatBench, a large-scale user study where we *converted* MMLU questions into thousands of user-AI conversations. Then, we trained a user simulator on ChatBench to generate user-AI outcomes on unseen questions. 1/ 🧵

April 11, 2025 at 5:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news