arkilpatel.github.io
We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!
Incredible effort by the entire team!
🌐: mcgill-nlp.github.io/thoughtology/
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!
Incredible effort by the entire team!
🌐: mcgill-nlp.github.io/thoughtology/
We study the reasoning chains of DeepSeek-R1 across a variety of tasks and find several surprising and interesting phenomena!
Incredible effort by the entire team!
🌐: mcgill-nlp.github.io/thoughtology/
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵: