Jirui Qi
jiruiqi.bsky.social
Jirui Qi
@jiruiqi.bsky.social
Ph.D Candidate @GroNLP, University of Groningen #NLProc
https://betswish.github.io
[10/] The results show that post-training on merely 100 instances sharply increases the matching rate to nearly 100% for TH and TE and to 80% for JA, but decreases accuracy, demonstrating the effectiveness of post-training to improve language matching, but the trade-off persists.
May 30, 2025 at 1:09 PM
[9/] To see whether further training can help, we post-train on Distilled-R1-7B using mini training sets of 100 or 250 instances per poor-matching language (Japanese, Thai, Telugu), resulting in six post-trained LRMs. The training data are filtered and translated from LIMO.
May 30, 2025 at 1:09 PM
[8/] Corresponding to the heatmaps, we further analyze the actual thinking languages of the LRM, where a clear mismatch is observed. Besides, all mismatches (i.e., red marks) fall into English or Chinese, suggesting the impact of thinking data on the model’s reasoning capability.
May 30, 2025 at 1:09 PM
[6/] Heatmaps by query/thinking language show the 32B LRM fails to generate traces in the prompted language—e.g., asked to think in FR, it defaults to EN. Motivating LRM to reason with hacking increases the matching from 46% to 98%, but introduces a noticeable accuracy decrement.
May 30, 2025 at 1:09 PM
[5/] Overall, LRMs struggle to follow instructions to think in user-specified languages with standard prompts. Motivating LRMs to generate traces in user query language with prompt hacking boosts language matching, but decreases accuracy, which shrinks as model size increases.
May 30, 2025 at 1:09 PM
[4/] Besides the standard prompting with explicitly specified thinking language in the instruction, we introduce and leverage the prompt hacking technique to induce the LRM to generate the thinking traces in the user-expected languages.
May 30, 2025 at 1:09 PM
[3/] We comprehensively evaluate six SOTA LRMs belonging to two families: Distilled-R1 and Skywork-OR1. Due to the lack of multilingual reasoning datasets, we introduce a novel benchmark named XReasoning, covering easy MGSM and translated challenging AIME2024, AIME2025, and GPQA_Diamond.
May 30, 2025 at 1:09 PM
[2/] The matching of thinking language is as important as accuracy because it makes the traces more readable and easier for users to verify. Even correct answers can feel untrustworthy if users can’t understand how the model gets there, especially as task complexity increases.
May 30, 2025 at 1:09 PM
[1/]💡New Paper
Large reasoning models (LRMs) are strong in English — but how well do they reason in your language?

Our latest work uncovers their limitation and a clear trade-off:
Controlling Thinking Trace Language Comes at the Cost of Accuracy

📄Link: arxiv.org/abs/2505.22888
May 30, 2025 at 1:09 PM
[7/] Including distractors, our analysis with both accuracy and feature attribution techniques further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence.
April 11, 2025 at 4:04 PM
[5/] Detailed heatmaps further showcase that answer accuracy is relatively consistent within each row, more so than within each column. In other words, the query language is much more predictive of accuracy than the passage language.
April 11, 2025 at 4:04 PM
[4/] Our experiments with 4 LLMs across 3 QA datasets, covering 48 languages, reveal a surprising ability of LLMs to extract relevant information from passages in different languages than the query, but a weaker ability to formulate an answer in the correct language (shading bars).
April 11, 2025 at 4:04 PM
[3/] Through accuracy and feature attribution analysis, we assess LLMs’ ability to make consistent use of a relevant passage regardless of its language, respond in expected languages, and focus on relevant passages even when distractors in different languages are provided.
April 11, 2025 at 4:04 PM
[2/] Multilingual RAG (mRAG) has been shown to be beneficial, particularly for low-resource languages. However, the extent to which LLMs can leverage multilingual contexts to generate accurate answers, independently from retrieval quality, remains understudied.
April 11, 2025 at 4:04 PM
✨ New Paper ✨
[1/] Retrieving passages from many languages can boost retrieval augmented generation (RAG) performance, but how good are LLMs at dealing with multilingual contexts in the prompt?

📄 Check it out: arxiv.org/abs/2504.00597
(w/ @arianna-bis.bsky.social @Raquel_Fernández)

#NLProc
April 11, 2025 at 4:04 PM
[7/8] Based on the above analysis, we propose two methods to optimize the prompt of the QA task with RAG. Experimental results show that both methods boost the model performance of the baseline prompt with documents in random order, supporting our hypothesis and claims above.
January 24, 2025 at 9:56 AM
[5/8] We also observe an overlap between LM performance and question likelihoods by reproducing the U-shape accuracy curve of Lost-in-the-Middle paper and plotting the corresponding p(question) in the same figure. Meanwhile, gold answer likelihoods change synchronously with them.
January 24, 2025 at 9:56 AM
[4/8] Zooming into the instance level, we find that each question is more likely to be correctly answered when it reaches the highest likelihood via document reordering with other segments unchanged in RAG pipelines.
January 24, 2025 at 9:56 AM
[3/8] Starting from corpus-level validation, we find that the questions with higher likelihoods in the dataset can be better answered than the ones with medium or low likelihoods.
January 24, 2025 at 9:56 AM
[2/8] LLM performance varies with changes in the prompt. However, lacking in-depth analysis limits the use of this phenomenon for prompt engineering in practice. In this study, we hypothesize and reveal that question likelihood can be a good performance gauge in RAG applications.
January 24, 2025 at 9:56 AM
[1/8] Seeking a faster approach to assess prompt quality in RAG QA? Our latest work may be a good fit for you. We find that prompt quality can be measured using question likelihoods, a computation that’s parallelizable on the input side of LLMs!
📄http://arxiv.org/abs/2411.07773
#NLProc
January 24, 2025 at 9:56 AM