https://jd730.github.io/
Unfortunately, none of the authors could attend the conference, but feel free to contact me if you have any questions!
icml.cc/virtual/2025...
Unfortunately, none of the authors could attend the conference, but feel free to contact me if you have any questions!
icml.cc/virtual/2025...
📊 On MGSM, BRIDGE improves both math and language accuracy in medium- and low-resource languages.
Even better:
• It maintains performance in English
• It succeeds where naive post-training and SFT or GRPO alone fail (especially in math).
📊 On MGSM, BRIDGE improves both math and language accuracy in medium- and low-resource languages.
Even better:
• It maintains performance in English
• It succeeds where naive post-training and SFT or GRPO alone fail (especially in math).
We also propose BRIDGE, a method that balances:
• Supervised fine-tuning for task-solving
• GRPO with a language consistency reward in reasoning.
This decouples multilingual ability from reasoning ability.
We also propose BRIDGE, a method that balances:
• Supervised fine-tuning for task-solving
• GRPO with a language consistency reward in reasoning.
This decouples multilingual ability from reasoning ability.
GeoFact-X lets us evaluate not just what models predict, but how they think.
We measure:
• Answer correctness
• Reasoning quality
• Language consistency
Models do better on region-language aligned pairs vs. mismatched ones.
GeoFact-X lets us evaluate not just what models predict, but how they think.
We measure:
• Answer correctness
• Reasoning quality
• Language consistency
Models do better on region-language aligned pairs vs. mismatched ones.
We introduce GeoFact-X, the first benchmark to evaluate language-consistent reasoning.
🌍 It includes multilingual CoT QA across 5 regions × 5 languages (EN, JA, SW, HI, TH)=25 region-language pairs.
Questions are grounded in regional facts, each with step-by-step reasoning.
We introduce GeoFact-X, the first benchmark to evaluate language-consistent reasoning.
🌍 It includes multilingual CoT QA across 5 regions × 5 languages (EN, JA, SW, HI, TH)=25 region-language pairs.
Questions are grounded in regional facts, each with step-by-step reasoning.
We evaluate leading LLMs (e.g., Qwen2.5, LLaMA-3, Gemma-3, DeepSeek-R1) on MGSM with native-language CoT.
🔍 Result:
Many models get the correct answer but default to English for reasoning, even when prompted otherwise.
That’s a serious misalignment.
We evaluate leading LLMs (e.g., Qwen2.5, LLaMA-3, Gemma-3, DeepSeek-R1) on MGSM with native-language CoT.
🔍 Result:
Many models get the correct answer but default to English for reasoning, even when prompted otherwise.
That’s a serious misalignment.
LLMs can answer in many languages.
But do they think in them?
Even when prompted in Swahili or Thai, models often switch to English for reasoning.
This breaks interpretability and trust.
So we ask: Can LLMs reason in the input language?
LLMs can answer in many languages.
But do they think in them?
Even when prompted in Swahili or Thai, models often switch to English for reasoning.
This breaks interpretability and trust.
So we ask: Can LLMs reason in the input language?
Thank you all for coming and we are delight that you enjoyed our mistakes.
We are also highly appeciate authors of MMSearch allowing us to use their panel.
Thank you all for coming and we are delight that you enjoyed our mistakes.
We are also highly appeciate authors of MMSearch allowing us to use their panel.
🗓 Saturday, April 26th 10:00 - 12:30 pm
📍 Hall 3 (Poster #55)
jd730.github.io/projects/FAR...
🗓 Saturday, April 26th 10:00 - 12:30 pm
📍 Hall 3 (Poster #55)
jd730.github.io/projects/FAR...