Valentin Hofmann
valentinhofmann.bsky.social
Valentin Hofmann
@valentinhofmann.bsky.social
Postdoc @ai2.bsky.social & @uwnlp.bsky.social
We did not specifically analyze novel models as your paper did. While I am optimistic that Fluid Benchmarking improves over static IRT-based methods in this regime as well, there are definitely limitations, which we discuss in the paragraph below.

Would be exciting to run more experiments on this!
September 19, 2025 at 6:52 PM
In our experiments, we find that this dynamic approach consistently outperforms static IRT-based methods. The improvements are especially pronounced in terms of variance, which poses a major challenge for static IRT-based methods. We discuss this in more detail in the paragraph below.
September 19, 2025 at 6:52 PM
Fluid Benchmarking substantially reduces step-to-step variance during pretraining.

It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
September 16, 2025 at 5:16 PM
In our experiments, we apply Fluid Benchmarking to evaluation during pretraining, a setting where capabilities evolve rapidly.

We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
September 16, 2025 at 5:16 PM
Fluid Benchmarking repeats this loop until the number of administered questions reaches the allotted budget.

Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
September 16, 2025 at 5:16 PM
In Fluid Benchmarking, we start with an initial ability estimate from one question.

To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).

Then we update the estimate.
September 16, 2025 at 5:16 PM
To get a question's difficulty, we use item response theory (IRT): we analyze responses of hundreds of LLMs to see how often a question is answered correctly.

IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
September 16, 2025 at 5:16 PM
📢 New #COLM2025 paper 📢

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧵
September 16, 2025 at 5:16 PM
Great to see the International AI Safety Report highlight research on dialect prejudice, including our work on covert racism in LLMs!

www.nature.com/articles/s41...
January 31, 2025 at 5:43 AM
We observe a frequency effect across all adjective classes that gradually gets stronger the more variable an adjective class is.

This is again exactly in line with analogical models, where rule-like behavior is the end of a gradient characterized by varying levels of regularity.
December 5, 2024 at 4:51 PM
As expected, rule-based and analogical models make the same predictions for regular adjective classes (-able, -ish) and thus explain GPT-J's behavior equally well.

However, for variable adjective classes (-ive, -ous), the analogical model results in a significantly better match.
December 5, 2024 at 4:51 PM
Here, we examine adjective nominalization with -ity and -ness.

While some adjective classes like adjectives with -ish have a clear preference for -ity or -ness (selfishness), others like adjectives with -ive sometimes prefer -ity (connectivity), sometimes -ness (effectiveness).
December 5, 2024 at 4:51 PM
📢 New paper 📢

What generalization mechanisms shape the language skills of LLMs?

Prior work has claimed that LLMs learn language via rules.

We revisit the question and find that superficially rule-like behavior of LLMs can be traced to underlying analogical processes.

🧵
December 5, 2024 at 4:51 PM