Lightnews — Scholar-powered news

Ken Liu

@kzliu.bsky.social

460 followers 64 following 14 posts

CS PhD @ Stanford AI Lab, Stanford NLP. Prev Google DeepMind.

https://ai.stanford.edu/~kzliu

Posts Replies Media Videos

Ken Liu

@kzliu.bsky.social

... and awesome collaborators & advisors!!
Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, @huaxiuyaoml.bsky.social , Linjun Zhang, Andrew Ng, @jameszou.bsky.social, @sanmikoyejo.bsky.social, @yejinchoinka.bsky.social, Percy Liang, @stanfordnlp.bsky.social, @stanfordhai.bsky.social

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

9/
UQ is an exploratory effort at creating a new paradigm for AI evals:
🌐 Platform: uq.stanford.edu
📄 Paper: arxiv.org/abs/2508.17580
💻 Code: github.com/uq-project/UQ
🤗 Data: huggingface.co/datasets/uq-...

Thanks to my wonderful project co-leads Fan Nie (applying for PhD!) and Niklas Muennighoff!!

Unsolved Questions (UQ) Project

An open platform for evaluating AI models on real-world, unsolved questions

uq.stanford.edu

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

8/
*UQ-Platform* (uq.stanford.edu) then continues where UQ-Validators leave off. It hosts the UQ-Dataset with AI answers and UQ-validation results, and experts can then rate AI answers, comment, and otherwise help resolve open questions -- just like Stack Exchange :). We need YOU to write reviews!

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

7/
*UQ-Validators* are simply LLMs (and compound LLM scaffolds) trying to pre-screen candidate answers to unsolved questions *without ground-truth answers*.

The key intuition is that it may be easier for LLMs to *validate* answers to hard questions (e.g. spotting mistakes) than to *generate* them.

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

6/
In contrast, we aim for UQ-Dataset to be difficult and realistic *by construction*: unsolved questions are often hard and naturally arise when humans seek answers, thus progress yields real-world value.

In exchange, we have to figure out how to evaluate models without answers...

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

5/
UQ started with the observation that benchmark saturation has led to a *difficulty-realism tension*:

1. We contrive harder exams that begin to lose touch of real-world model usage
2. We build realistic evals (e.g. use human preferences) that became easy and/or hackable

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

4/
Here are some sample questions in the UQ-Dataset, which spans math, physics, CS theory, history, puzzles, scifi, and more; see uq.stanford.edu for full list!

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

3/
Our main idea: rather than having static benchmarks scored once, can we evaluate LLMs *continuously and asynchronously* on real-world Qs with an actual need?

UQ-Dataset provides inputs → UQ-Validators screen outputs → UQ-Platform hosts live verification and model ranking.

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

2/
The UQ project has 3 parts:
1. UQ-Dataset: 500 hard, popular, old, yet unanswered questions from Stack Exchange network
2. UQ-Validators: LLM critics to pre-screen model answers
3. UQ-Platform (uq.stanford.edu): community verification (think AI-native Stack Exchange!)

Unsolved Questions (UQ) Project

An open platform for evaluating AI models on real-world, unsolved questions

uq.stanford.edu

August 26, 2025 at 5:51 PM

Ken Liu

@kzliu.bsky.social

🙋🏻‍♂️

November 24, 2024 at 5:17 AM

Ken Liu

@kzliu.bsky.social

🙋‍♂️

November 21, 2024 at 7:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news