Lightnews — Scholar-powered news

Jaedong Hwang

@jaedonghwang.bsky.social

10/10 This work was a wonderful collaboration with Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, and Paul Pu Liang.
📘 Paper: arxiv.org/pdf/2507.05418
🌐 Project: jd730.github.io/projects/Geo...
#LLM #MultilingualAI #Reasoning #NLP #AI #LanguageModels

arxiv.org

July 15, 2025 at 3:46 PM

Jaedong Hwang

@jaedonghwang.bsky.social

9/10
This matters:
✔️ For global inclusivity
✔️ For users who expect interpretable reasoning in their native language
✔️ For fair multilingual evaluation
🧠 LLMs shouldn’t just give the right answer—they should think in your language.

July 15, 2025 at 3:44 PM

Jaedong Hwang

@jaedonghwang.bsky.social

8/10
📊 On MGSM, BRIDGE improves both math and language accuracy in medium- and low-resource languages.
Even better:
• It maintains performance in English
• It succeeds where naive post-training and SFT or GRPO alone fail (especially in math).

July 15, 2025 at 3:44 PM

Jaedong Hwang

@jaedonghwang.bsky.social

7/10
We also propose BRIDGE, a method that balances:
• Supervised fine-tuning for task-solving
• GRPO with a language consistency reward in reasoning.
This decouples multilingual ability from reasoning ability.

July 15, 2025 at 3:43 PM

Jaedong Hwang

@jaedonghwang.bsky.social

6/10
GeoFact-X lets us evaluate not just what models predict, but how they think.
We measure:
• Answer correctness
• Reasoning quality
• Language consistency
Models do better on region-language aligned pairs vs. mismatched ones.

July 15, 2025 at 3:41 PM

Jaedong Hwang

@jaedonghwang.bsky.social

5/10
We introduce GeoFact-X, the first benchmark to evaluate language-consistent reasoning.
🌍 It includes multilingual CoT QA across 5 regions × 5 languages (EN, JA, SW, HI, TH)=25 region-language pairs.
Questions are grounded in regional facts, each with step-by-step reasoning.

July 15, 2025 at 3:40 PM

Jaedong Hwang

@jaedonghwang.bsky.social

4/10
We evaluate leading LLMs (e.g., Qwen2.5, LLaMA-3, Gemma-3, DeepSeek-R1) on MGSM with native-language CoT.
🔍 Result:
Many models get the correct answer but default to English for reasoning, even when prompted otherwise.
That’s a serious misalignment.

July 15, 2025 at 3:40 PM

Jaedong Hwang

@jaedonghwang.bsky.social

3/10
Existing multilingual benchmarks (e.g., MGSM, MMLU-ProX) only evaluate if the final answer is correct in the target language.
They don’t measure if the reasoning process (CoT) is in the same language.
That gap matters for transparency, fairness, and inclusivity.

July 15, 2025 at 3:39 PM

Jaedong Hwang

@jaedonghwang.bsky.social

2/10
Today’s LLMs are multilingual-ish.
They often generate answers in the input language, but their reasoning steps (chain-of-thought) default to English, especially after post-training on English data.

July 15, 2025 at 3:39 PM

Jaedong Hwang

@jaedonghwang.bsky.social

If I remember correctly, that was also the first CV conference with over 1000 papers, and people already felt overwhelmed. Now, CVPR 2025 has 2800+ papers, and #NeurIPS2024 had 4497. It’s becoming nearly impossible to discover hidden gems while wandering poster sessions. 2/2

June 12, 2025 at 12:26 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news