Would be exciting to run more experiments on this!
Would be exciting to run more experiments on this!
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴
Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.
🧵
Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴
Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.
🧵
www.nature.com/articles/s41...
www.nature.com/articles/s41...
This is again exactly in line with analogical models, where rule-like behavior is the end of a gradient characterized by varying levels of regularity.
This is again exactly in line with analogical models, where rule-like behavior is the end of a gradient characterized by varying levels of regularity.
However, for variable adjective classes (-ive, -ous), the analogical model results in a significantly better match.
However, for variable adjective classes (-ive, -ous), the analogical model results in a significantly better match.
While some adjective classes like adjectives with -ish have a clear preference for -ity or -ness (selfishness), others like adjectives with -ive sometimes prefer -ity (connectivity), sometimes -ness (effectiveness).
While some adjective classes like adjectives with -ish have a clear preference for -ity or -ness (selfishness), others like adjectives with -ive sometimes prefer -ity (connectivity), sometimes -ness (effectiveness).
What generalization mechanisms shape the language skills of LLMs?
Prior work has claimed that LLMs learn language via rules.
We revisit the question and find that superficially rule-like behavior of LLMs can be traced to underlying analogical processes.
🧵
What generalization mechanisms shape the language skills of LLMs?
Prior work has claimed that LLMs learn language via rules.
We revisit the question and find that superficially rule-like behavior of LLMs can be traced to underlying analogical processes.
🧵