alonj.github.io
They use FlenQA (w/ @moshlevy.bsky.social ) to show their model improves here massively.
arxiv.org/abs/2504.21318
They use FlenQA (w/ @moshlevy.bsky.social ) to show their model improves here massively.
arxiv.org/abs/2504.21318
Linguistic evaluations of LLMs often implicitly assume that language is generated by symbolic rules.
In a new position paper, @adelegoldberg.bsky.social, @kmahowald.bsky.social and I argue that languages are not Lego sets, and evaluations should reflect this!
arxiv.org/pdf/2502.13195
Linguistic evaluations of LLMs often implicitly assume that language is generated by symbolic rules.
In a new position paper, @adelegoldberg.bsky.social, @kmahowald.bsky.social and I argue that languages are not Lego sets, and evaluations should reflect this!
arxiv.org/pdf/2502.13195
Also, there used to be a csv of checkpoints in the OLMo repo, but it's gone (guessing since OLMo 2)...
Help will be appreciated
Also, there used to be a csv of checkpoints in the OLMo repo, but it's gone (guessing since OLMo 2)...
Help will be appreciated
How can we ensure that models are accurate on samples from classes rarely seen in training?
How can we ensure that models are accurate on samples from classes rarely seen in training?