o3-mini-high and Deepseek-R1 overthink for a word frequency task! Also, Incorrect answers often had longer reasoning chains than correct ones.
More reasoning ≠ better accuracy!
o3-mini-high and Deepseek-R1 overthink for a word frequency task! Also, Incorrect answers often had longer reasoning chains than correct ones.
More reasoning ≠ better accuracy!
📉 🇺🇸 context + 🇰🇷 instruction → 91% → 71%
📈 🇰🇷 context + 🇺🇸 instruction → 67% → 75%
Instruction language matters more than expected for multilingual LLMs!
📉 🇺🇸 context + 🇰🇷 instruction → 91% → 71%
📈 🇰🇷 context + 🇺🇸 instruction → 67% → 75%
Instruction language matters more than expected for multilingual LLMs!
🚨 o3-mini-high underperforms on English at long contexts.
📊 Qwen2.5 > LLaMA 3.3 across all context lengths.
🚩 Non-Latin & non-Cyrillic scripts remain a challenge.
🚨 o3-mini-high underperforms on English at long contexts.
📊 Qwen2.5 > LLaMA 3.3 across all context lengths.
🚩 Non-Latin & non-Cyrillic scripts remain a challenge.
We added the option to answer "none" if the needle wasn’t in the context. 🚨 o3-mini-high especially struggled, accuracy dropped 32% at 128K! It frequently answers "none" even when the needle was there.
We added the option to answer "none" if the needle wasn’t in the context. 🚨 o3-mini-high especially struggled, accuracy dropped 32% at 128K! It frequently answers "none" even when the needle was there.
At 8K tokens, high vs. low-resource language gap = 11%
At 128K tokens, the gap triples to 34%! 📉
LLMs struggle to generalize long-context skills across diverse languages.
At 8K tokens, high vs. low-resource language gap = 11%
At 128K tokens, the gap triples to 34%! 📉
LLMs struggle to generalize long-context skills across diverse languages.
🇵🇱 Polish takes the top spot, while 🇨🇳 Chinese ranks 4th from the bottom, despite forming a large proportion of pretraining data.
Slavic, Romance & Germanic languages dominate, suggesting long-context strength isn’t just about training data size!
🇵🇱 Polish takes the top spot, while 🇨🇳 Chinese ranks 4th from the bottom, despite forming a large proportion of pretraining data.
Slavic, Romance & Germanic languages dominate, suggesting long-context strength isn’t just about training data size!
We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!
Our analysis across 26 languages 🧵👇
We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!
Our analysis across 26 languages 🧵👇