Alon Jacoby
alon-j.bsky.social
Alon Jacoby
@alon-j.bsky.social
PhD student @ Penn
alonj.github.io
The Phi 4 Reasoning technical report is a good reminder that current models still suffer massive performance degradation when reasoning tasks get longer - even at just 3K tokens!
They use FlenQA (w/ @moshlevy.bsky.social ) to show their model improves here massively.
arxiv.org/abs/2504.21318
May 7, 2025 at 2:07 PM
Reposted by Alon Jacoby
✨New paper✨

Linguistic evaluations of LLMs often implicitly assume that language is generated by symbolic rules.
In a new position paper, @adelegoldberg.bsky.social, @kmahowald.bsky.social and I argue that languages are not Lego sets, and evaluations should reflect this!

arxiv.org/pdf/2502.13195
February 20, 2025 at 3:06 PM
Sometimes I want to track small changes in code without too much hassle, so I made pydift: replace "python script.py" with "pydift script.py", and diffs from previous runs will be saved automatically.
February 2, 2025 at 11:19 PM
How does one figure out exactly which samples were seen in training for a given OLMo checkpoint? Where is that information shared or stored?
Also, there used to be a csv of checkpoints in the OLMo repo, but it's gone (guessing since OLMo 2)...
Help will be appreciated
December 15, 2024 at 1:48 PM
We've come to expect LLMs to be generalist models which are accurate in zero-shot settings - such as QA in different domains, reasoning types, or even low resource languages.
How can we ensure that models are accurate on samples from classes rarely seen in training?
I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with Yuval Benjamini. The gist is in the comments, and come chat with me to hear more!
December 11, 2024 at 12:22 AM
Reposted by Alon Jacoby
I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with Yuval Benjamini. The gist is in the comments, and come chat with me to hear more!
December 10, 2024 at 10:07 PM