Lightnews — Scholar-powered news

Yekyung Kim

@yekyung.bsky.social

39 followers 110 following 8 posts

PhD student @ UMass NLP

Posts Replies Media Videos

Yekyung Kim

@yekyung.bsky.social

Reasoning models "overthink" simple tasks! 🤯

o3-mini-high and Deepseek-R1 overthink for a word frequency task! Also, Incorrect answers often had longer reasoning chains than correct ones.

More reasoning ≠ better accuracy!

March 5, 2025 at 5:06 PM

Yekyung Kim

@yekyung.bsky.social

Instruction language shifts accuracy by up to 20%! 🏗️

📉 🇺🇸 context + 🇰🇷 instruction → 91% → 71%
📈 🇰🇷 context + 🇺🇸 instruction → 67% → 75%

Instruction language matters more than expected for multilingual LLMs!

March 5, 2025 at 5:06 PM

Yekyung Kim

@yekyung.bsky.social

🏆 Gemini 1.5 Flash shines in Sesotho & Swahili, but struggles on non-Latin scripts like ZH, KO and HI.
🚨 o3-mini-high underperforms on English at long contexts.
📊 Qwen2.5 > LLaMA 3.3 across all context lengths.
🚩 Non-Latin & non-Cyrillic scripts remain a challenge.

March 5, 2025 at 5:06 PM

Yekyung Kim

@yekyung.bsky.social

The "nonexistent needle" problem 🪡

We added the option to answer "none" if the needle wasn’t in the context. 🚨 o3-mini-high especially struggled, accuracy dropped 32% at 128K! It frequently answers "none" even when the needle was there.

March 5, 2025 at 5:06 PM

Yekyung Kim

@yekyung.bsky.social

Performance gaps grow with context length! ⏳

At 8K tokens, high vs. low-resource language gap = 11%
At 128K tokens, the gap triples to 34%! 📉

LLMs struggle to generalize long-context skills across diverse languages.

March 5, 2025 at 5:06 PM

Yekyung Kim

@yekyung.bsky.social

English ranks only 6th! 🤯

🇵🇱 Polish takes the top spot, while 🇨🇳 Chinese ranks 4th from the bottom, despite forming a large proportion of pretraining data.

Slavic, Romance & Germanic languages dominate, suggesting long-context strength isn’t just about training data size!

March 5, 2025 at 5:06 PM

Yekyung Kim

@yekyung.bsky.social

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages 🧵👇

March 5, 2025 at 5:06 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news