Lisa Alazraki
banner
lisaalaz.bsky.social
Lisa Alazraki
@lisaalaz.bsky.social
PhD student @ImperialCollege. Research Scientist Intern @Meta prev. @Cohere, @GoogleAI. Interested in generalisable learning and reasoning. She/her

lisaalaz.github.io
To learn more:
Website: agentcoma.github.io
Preprint: arxiv.org/abs/2508.19988

A big thanks to my brilliant coauthors Lihu Chen, Ana Brassard, @joestacey.bsky.social, @rahmanidashti.bsky.social and @marekrei.bsky.social!

Note: We welcome submissions to the #AgentCoMa leaderboard from researchers 🚀
AgentCoMa
AgentCoMa is an Agentic Commonsense and Math benchmark where each compositional task requires both commonsense and mathematical reasoning to be solved. The tasks are set in real-world scenarios:…
agentcoma.github.io
August 28, 2025 at 2:01 PM
We also observe that LLMs fail to activate all the relevant neurons when they attempt to solve the tasks in Agent-CoMa. Instead, they mostly activate neurons relevant to only one reasoning type, likely as a result of single-type reasoning patterns reinforced during training.
August 28, 2025 at 2:01 PM
So why do LLMs perform poorly on the apparently simple tasks in #AgentCoMa?

We find that tasks combining different reasoning types are a relatively unseen pattern for LLMs, leading the models to contextual hallucinations when presented with mixed-type compositional reasoning.
August 28, 2025 at 2:01 PM
In contrast, we find that:

- LLMs perform relatively well on compositional tasks of similar difficulty when all steps require the same type of reasoning.

- Non-expert humans with no calculator or internet can solve the tasks in #AgentCoMa as accurately as the individual steps.
August 28, 2025 at 2:01 PM
We test AgentCoMa on 61 contemporary LLMs of different sizes, including reasoning models (both SFT and RL-tuned). While the LLMs perform well on commonsense and math reasoning in isolation, they are far less effective at solving AgentCoMa tasks that require their composition!
August 28, 2025 at 2:01 PM
Check out our preprint on ArXiv to learn more arxiv.org/abs/2505.15795

This work was done at @cohere.com with fantastic team @maxbartolo.bsky.social, Tan Yi-Chern, Jon Ander Campos, @maximilianmozes.bsky.social, @marekrei.bsky.social
May 22, 2025 at 3:01 PM
We also postulate that the benefits of RLRE do not end at adversarial attacks. Reverse engineering human preferences could be used for a variety of applications, including but not limited to meaningful tasks such as reducing toxicity or mitigating bias 🔥
May 22, 2025 at 3:01 PM
Interestingly, we observe substantial variations in the fluency and naturalness of the optimal preambles, suggesting that conditioning LLMs on human-readable sequences only may be overly restrictive from a performance perspective 🤯
May 22, 2025 at 3:01 PM
We use RLRE to adversarially boost LLM-as-a-judge evaluation, and find the method is not only effective, but also virtually undetectable and transferable to previously unseen LLMs!
May 22, 2025 at 3:01 PM
To learn more, check out our preprint on ArXiv: arxiv.org/abs/2502.08550
This work was done at @cohere.com with amazing collaborators @maxbartolo.bsky.social, @maximilianmozes.bsky.social, Jon Ander Campos, Yi Chern Tan and @marekrei.bsky.social.
LLMs can implicitly learn from mistakes in-context
Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensiv...
arxiv.org
February 13, 2025 at 3:38 PM
These findings are surprising, as rationales are prevalent in current frameworks for learning from mistakes with LLMs, despite being expensive to curate at scale. Our investigation suggests they are redundant and can even hurt performance by adding unnecessary constraints!
February 13, 2025 at 3:38 PM
Additionally, our analysis shows that LLMs can implicitly infer high-quality corrective rationales when prompted only with correct and incorrect answers, and that these are of equal quality as those generated with the aid of explicit exemplar rationales.
February 13, 2025 at 3:38 PM
We find the implicit setup without rationales is consistently superior in all cases. It also overwhelmingly outperforms CoT, even when we make this baseline more challenging by extending its context with additional, diverse question-answer pairs.
February 13, 2025 at 3:38 PM
We test these setups across multiple LLMs from different model families, multiple datasets of varying difficulty, and different fine-grained tasks: labelling an answer (or an individual reasoning step) as correct or not, editing an incorrect answer, and answering a new question.
February 13, 2025 at 3:38 PM
We construct few-shot prompts containing mathematical reasoning questions, alongside incorrect and correct answers. We compare this simple, implicit setup to the one that additionally includes explicit rationales illustrating how to turn an incorrect answer into a correct one.
February 13, 2025 at 3:38 PM