Valentin Hofmann
@valentinhofmann.bsky.social
Postdoc @ai2.bsky.social & @uwnlp.bsky.social
Thanks, Jordan! Your ACL 2021 paper was a huge source of inspiration for us!
September 19, 2025 at 7:04 PM
Thanks, Jordan! Your ACL 2021 paper was a huge source of inspiration for us!
We did not specifically analyze novel models as your paper did. While I am optimistic that Fluid Benchmarking improves over static IRT-based methods in this regime as well, there are definitely limitations, which we discuss in the paragraph below.
Would be exciting to run more experiments on this!
Would be exciting to run more experiments on this!
September 19, 2025 at 6:52 PM
We did not specifically analyze novel models as your paper did. While I am optimistic that Fluid Benchmarking improves over static IRT-based methods in this regime as well, there are definitely limitations, which we discuss in the paragraph below.
Would be exciting to run more experiments on this!
Would be exciting to run more experiments on this!
In our experiments, we find that this dynamic approach consistently outperforms static IRT-based methods. The improvements are especially pronounced in terms of variance, which poses a major challenge for static IRT-based methods. We discuss this in more detail in the paragraph below.
September 19, 2025 at 6:52 PM
In our experiments, we find that this dynamic approach consistently outperforms static IRT-based methods. The improvements are especially pronounced in terms of variance, which poses a major challenge for static IRT-based methods. We discuss this in more detail in the paragraph below.
Great question! The key difference is that we use IRT to dynamically adapt the subset of items to a model's capability, rather than to determine a static, "globally optimal" subset of items as in prior work. With Fluid Benchmarking, each model is evaluated on a different subset of items.
September 19, 2025 at 6:52 PM
Great question! The key difference is that we use IRT to dynamically adapt the subset of items to a model's capability, rather than to determine a static, "globally optimal" subset of items as in prior work. With Fluid Benchmarking, each model is evaluated on a different subset of items.
Last but not least, a huge shoutout to my incredible coauthors @davidheineman.com, @ianmagnusson.bsky.social, @kylelo.bsky.social, @jessedodge.bsky.social, @maartensap.bsky.social, Pang Wei Koh, Chun Wang, @hanna-nlp.bsky.social, and @nlpnoah.bsky.social! 🤗
September 16, 2025 at 5:16 PM
Last but not least, a huge shoutout to my incredible coauthors @davidheineman.com, @ianmagnusson.bsky.social, @kylelo.bsky.social, @jessedodge.bsky.social, @maartensap.bsky.social, Pang Wei Koh, Chun Wang, @hanna-nlp.bsky.social, and @nlpnoah.bsky.social! 🤗
For details, check out our paper, blog, code, and data:
📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...
Looking forward to chatting more at #COLM2025! 👋
📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...
Looking forward to chatting more at #COLM2025! 👋
September 16, 2025 at 5:16 PM
For details, check out our paper, blog, code, and data:
📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...
Looking forward to chatting more at #COLM2025! 👋
📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...
Looking forward to chatting more at #COLM2025! 👋
Overall, our work shows that LLM evaluations can be substantially improved by moving beyond the until-now universal practice of static benchmarking, which assumes a globally optimal set of evaluation questions for all models.
September 16, 2025 at 5:16 PM
Overall, our work shows that LLM evaluations can be substantially improved by moving beyond the until-now universal practice of static benchmarking, which assumes a globally optimal set of evaluation questions for all models.
These (and more) advantages are achieved while at the same time reducing evaluation cost.
Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡
Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡
September 16, 2025 at 5:16 PM
These (and more) advantages are achieved while at the same time reducing evaluation cost.
Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡
Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡
Fluid Benchmarking substantially reduces step-to-step variance during pretraining.
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
September 16, 2025 at 5:16 PM
Fluid Benchmarking substantially reduces step-to-step variance during pretraining.
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
In our experiments, we apply Fluid Benchmarking to evaluation during pretraining, a setting where capabilities evolve rapidly.
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
September 16, 2025 at 5:16 PM
In our experiments, we apply Fluid Benchmarking to evaluation during pretraining, a setting where capabilities evolve rapidly.
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
Fluid Benchmarking repeats this loop until the number of administered questions reaches the allotted budget.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
September 16, 2025 at 5:16 PM
Fluid Benchmarking repeats this loop until the number of administered questions reaches the allotted budget.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
In Fluid Benchmarking, we start with an initial ability estimate from one question.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
September 16, 2025 at 5:16 PM
In Fluid Benchmarking, we start with an initial ability estimate from one question.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
In addition, IRT models each LLM's ability, which can be estimated from its responses to questions with known difficulty and discrimination.
The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.
The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.
September 16, 2025 at 5:16 PM
In addition, IRT models each LLM's ability, which can be estimated from its responses to questions with known difficulty and discrimination.
The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.
The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.
To get a question's difficulty, we use item response theory (IRT): we analyze responses of hundreds of LLMs to see how often a question is answered correctly.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
September 16, 2025 at 5:16 PM
To get a question's difficulty, we use item response theory (IRT): we analyze responses of hundreds of LLMs to see how often a question is answered correctly.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
Test theory says: questions are most informative when matched to a test taker's ability.
For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.
But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔
For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.
But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔
September 16, 2025 at 5:16 PM
Test theory says: questions are most informative when matched to a test taker's ability.
For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.
But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔
For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.
But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔
Huge congrats, Adam!!! 🎉
May 29, 2025 at 4:15 PM
Huge congrats, Adam!!! 🎉