Yekyung Kim
yekyung.bsky.social
Yekyung Kim
@yekyung.bsky.social
PhD student @ UMass NLP
Reasoning models "overthink" simple tasks! 🤯

o3-mini-high and Deepseek-R1 overthink for a word frequency task! Also, Incorrect answers often had longer reasoning chains than correct ones.

More reasoning ≠ better accuracy!
March 5, 2025 at 5:06 PM
Instruction language shifts accuracy by up to 20%! 🏗️

📉 🇺🇸 context + 🇰🇷 instruction → 91% → 71%
📈 🇰🇷 context + 🇺🇸 instruction → 67% → 75%

Instruction language matters more than expected for multilingual LLMs!
March 5, 2025 at 5:06 PM
🏆 Gemini 1.5 Flash shines in Sesotho & Swahili, but struggles on non-Latin scripts like ZH, KO and HI.
🚨 o3-mini-high underperforms on English at long contexts.
📊 Qwen2.5 > LLaMA 3.3 across all context lengths.
🚩 Non-Latin & non-Cyrillic scripts remain a challenge.
March 5, 2025 at 5:06 PM
The "nonexistent needle" problem 🪡

We added the option to answer "none" if the needle wasn’t in the context. 🚨 o3-mini-high especially struggled, accuracy dropped 32% at 128K! It frequently answers "none" even when the needle was there.
March 5, 2025 at 5:06 PM
Performance gaps grow with context length! ⏳

At 8K tokens, high vs. low-resource language gap = 11%
At 128K tokens, the gap triples to 34%! 📉

LLMs struggle to generalize long-context skills across diverse languages.
March 5, 2025 at 5:06 PM
English ranks only 6th! 🤯

🇵🇱 Polish takes the top spot, while 🇨🇳 Chinese ranks 4th from the bottom, despite forming a large proportion of pretraining data.

Slavic, Romance & Germanic languages dominate, suggesting long-context strength isn’t just about training data size!
March 5, 2025 at 5:06 PM
Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages 🧵👇
March 5, 2025 at 5:06 PM