Xing Han Lu
@xhluca.bsky.social
👨🍳 Web Agents @mila-quebec.bsky.social
🎒 @mcgill-nlp.bsky.social
🎒 @mcgill-nlp.bsky.social
Reposted by Xing Han Lu
Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
July 29, 2025 at 12:06 PM
Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
Reposted by Xing Han Lu
A blizzard is raging through Montreal when your friend says “Looks like Florida out there!” Humans easily interpret irony, while LLMs struggle with it. We propose a 𝘳𝘩𝘦𝘵𝘰𝘳𝘪𝘤𝘢𝘭-𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘺-𝘢𝘸𝘢𝘳𝘦 probabilistic framework as a solution.
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
June 26, 2025 at 3:52 PM
A blizzard is raging through Montreal when your friend says “Looks like Florida out there!” Humans easily interpret irony, while LLMs struggle with it. We propose a 𝘳𝘩𝘦𝘵𝘰𝘳𝘪𝘤𝘢𝘭-𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘺-𝘢𝘸𝘢𝘳𝘦 probabilistic framework as a solution.
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
"Build the web for agents, not agents for the web"
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
June 14, 2025 at 4:17 AM
"Build the web for agents, not agents for the web"
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
Reposted by Xing Han Lu
Excited to share the results of my recent internship!
We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details 👇🔬
We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details 👇🔬
June 13, 2025 at 2:47 PM
Excited to share the results of my recent internship!
We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details 👇🔬
We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details 👇🔬
Reposted by Xing Han Lu
Do LLMs hallucinate randomly? Not quite.
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
📎 Paper: arxiv.org/abs/2505.22630 1/n
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
📎 Paper: arxiv.org/abs/2505.22630 1/n
June 6, 2025 at 6:10 PM
Do LLMs hallucinate randomly? Not quite.
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
📎 Paper: arxiv.org/abs/2505.22630 1/n
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
📎 Paper: arxiv.org/abs/2505.22630 1/n
Reposted by Xing Han Lu
Congratulations to Mila members @adadtur.bsky.social , Gaurav Kamath and @sivareddyg.bsky.social for their SAC award at NAACL! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670
May 1, 2025 at 2:30 PM
Congratulations to Mila members @adadtur.bsky.social , Gaurav Kamath and @sivareddyg.bsky.social for their SAC award at NAACL! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670
Reposted by Xing Han Lu
Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by @xhluca.bsky.social & team!
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
April 15, 2025 at 7:11 PM
Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by @xhluca.bsky.social & team!
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
April 15, 2025 at 7:10 PM
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
Reposted by Xing Han Lu
And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below:
🔗 arxiv.org/abs/2504.07128
🔗 arxiv.org/abs/2504.07128
April 11, 2025 at 4:31 PM
And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below:
🔗 arxiv.org/abs/2504.07128
🔗 arxiv.org/abs/2504.07128
DeepSeek-R1 Thoughtology: Let’s about LLM reasoning
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
April 12, 2025 at 4:11 PM
DeepSeek-R1 Thoughtology: Let’s about LLM reasoning
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
Reposted by Xing Han Lu
Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts ✨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/
🔗: mcgill-nlp.github.io/thoughtology/
April 1, 2025 at 8:12 PM
Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts ✨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.
Reposted by Xing Han Lu
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/
🔗: mcgill-nlp.github.io/thoughtology/
April 1, 2025 at 8:07 PM
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/
🔗: mcgill-nlp.github.io/thoughtology/
Reposted by Xing Han Lu
Check out our new workshop on Actionable Interpretability @ ICML 2025. We are also looking forward to submissions that take a position on the future of interpretability research more broadly. 👇
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
March 31, 2025 at 6:15 PM
Check out our new workshop on Actionable Interpretability @ ICML 2025. We are also looking forward to submissions that take a position on the future of interpretability research more broadly. 👇
Reposted by Xing Han Lu
📢Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025!
🌐 sites.google.com/view/vlms4all
🌐 sites.google.com/view/vlms4all
March 14, 2025 at 3:55 PM
📢Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025!
🌐 sites.google.com/view/vlms4all
🌐 sites.google.com/view/vlms4all
Reposted by Xing Han Lu
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
mcgill-nlp.github.io
March 12, 2025 at 4:15 PM
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Reposted by Xing Han Lu
Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks.
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
March 10, 2025 at 8:11 PM
Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks.
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
Reposted by Xing Han Lu
The potential for malicious misuse of LLM agents is a serious threat.
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇
March 10, 2025 at 6:20 PM
The potential for malicious misuse of LLM agents is a serious threat.
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇
Reposted by Xing Han Lu
Llamas browsing the web look cute, but they are capable of causing a lot of harm!
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
March 10, 2025 at 5:51 PM
Llamas browsing the web look cute, but they are capable of causing a lot of harm!
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
March 10, 2025 at 5:45 PM
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
Reposted by Xing Han Lu
📢New Paper Alert!🚀
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
March 4, 2025 at 4:08 PM
📢New Paper Alert!🚀
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
Reposted by Xing Han Lu
Check out the new MMTEB benchmark🙌 if you are looking for an extensive, reproducible and open-source evaluation of text embedders!
I am delighted to announce that we have released 🎊 MMTEB 🎊, a large-scale collaboration working on efficient multilingual evaluation of embedding models.
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
February 20, 2025 at 3:44 PM
Check out the new MMTEB benchmark🙌 if you are looking for an extensive, reproducible and open-source evaluation of text embedders!
I'm fortunate to have collaborated with a team of brilliant researchers on this colossal project 🎊
Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
I am delighted to announce that we have released 🎊 MMTEB 🎊, a large-scale collaboration working on efficient multilingual evaluation of embedding models.
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
February 22, 2025 at 11:21 PM
I'm fortunate to have collaborated with a team of brilliant researchers on this colossal project 🎊
Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
Reposted by Xing Han Lu
Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
February 21, 2025 at 4:29 PM
Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
Reposted by Xing Han Lu
I am delighted to announce that we have released 🎊 MMTEB 🎊, a large-scale collaboration working on efficient multilingual evaluation of embedding models.
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
February 20, 2025 at 9:56 AM
I am delighted to announce that we have released 🎊 MMTEB 🎊, a large-scale collaboration working on efficient multilingual evaluation of embedding models.
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
Reposted by Xing Han Lu
Interested in knowing more about LLMs agents and in contributing to this topic?🚀
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
January 23, 2025 at 2:29 PM
Interested in knowing more about LLMs agents and in contributing to this topic?🚀
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social