Lightnews — Scholar-powered news

Lisa Alazraki

@lisaalaz.bsky.social

To learn more:
Website: agentcoma.github.io
Preprint: arxiv.org/abs/2508.19988

A big thanks to my brilliant coauthors Lihu Chen, Ana Brassard, @joestacey.bsky.social, @rahmanidashti.bsky.social and @marekrei.bsky.social!

Note: We welcome submissions to the #AgentCoMa leaderboard from researchers 🚀

AgentCoMa

AgentCoMa is an Agentic Commonsense and Math benchmark where each compositional task requires both commonsense and mathematical reasoning to be solved. The tasks are set in real-world scenarios:…

agentcoma.github.io

August 28, 2025 at 2:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

We also observe that LLMs fail to activate all the relevant neurons when they attempt to solve the tasks in Agent-CoMa. Instead, they mostly activate neurons relevant to only one reasoning type, likely as a result of single-type reasoning patterns reinforced during training.

August 28, 2025 at 2:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

So why do LLMs perform poorly on the apparently simple tasks in #AgentCoMa?

We find that tasks combining different reasoning types are a relatively unseen pattern for LLMs, leading the models to contextual hallucinations when presented with mixed-type compositional reasoning.

August 28, 2025 at 2:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

In contrast, we find that:

- LLMs perform relatively well on compositional tasks of similar difficulty when all steps require the same type of reasoning.

- Non-expert humans with no calculator or internet can solve the tasks in #AgentCoMa as accurately as the individual steps.

August 28, 2025 at 2:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

We test AgentCoMa on 61 contemporary LLMs of different sizes, including reasoning models (both SFT and RL-tuned). While the LLMs perform well on commonsense and math reasoning in isolation, they are far less effective at solving AgentCoMa tasks that require their composition!

August 28, 2025 at 2:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

Check out our preprint on ArXiv to learn more arxiv.org/abs/2505.15795

This work was done at @cohere.com with fantastic team @maxbartolo.bsky.social, Tan Yi-Chern, Jon Ander Campos, @maximilianmozes.bsky.social, @marekrei.bsky.social

May 22, 2025 at 3:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

We also postulate that the benefits of RLRE do not end at adversarial attacks. Reverse engineering human preferences could be used for a variety of applications, including but not limited to meaningful tasks such as reducing toxicity or mitigating bias 🔥

May 22, 2025 at 3:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

Interestingly, we observe substantial variations in the fluency and naturalness of the optimal preambles, suggesting that conditioning LLMs on human-readable sequences only may be overly restrictive from a performance perspective 🤯

May 22, 2025 at 3:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

We use RLRE to adversarially boost LLM-as-a-judge evaluation, and find the method is not only effective, but also virtually undetectable and transferable to previously unseen LLMs!

May 22, 2025 at 3:01 PM

Lisa Alazraki

@lisaalaz.bsky.social

Link to the paper: arxiv.org/abs/2411.04535

Meta-Reasoning Improves Tool Use in Large Language Models

External tools help large language models succeed at tasks where they would otherwise typically fail. In existing frameworks, choosing tools at test time relies on naive greedy decoding, regardless of...

arxiv.org

April 30, 2025 at 8:59 PM

Lisa Alazraki

@lisaalaz.bsky.social

To learn more, check out our preprint on ArXiv: arxiv.org/abs/2502.08550
This work was done at @cohere.com with amazing collaborators @maxbartolo.bsky.social, @maximilianmozes.bsky.social, Jon Ander Campos, Yi Chern Tan and @marekrei.bsky.social.

LLMs can implicitly learn from mistakes in-context

Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensiv...

arxiv.org

February 13, 2025 at 3:38 PM

Lisa Alazraki

@lisaalaz.bsky.social

These findings are surprising, as rationales are prevalent in current frameworks for learning from mistakes with LLMs, despite being expensive to curate at scale. Our investigation suggests they are redundant and can even hurt performance by adding unnecessary constraints!

February 13, 2025 at 3:38 PM

Lisa Alazraki

@lisaalaz.bsky.social

Additionally, our analysis shows that LLMs can implicitly infer high-quality corrective rationales when prompted only with correct and incorrect answers, and that these are of equal quality as those generated with the aid of explicit exemplar rationales.

February 13, 2025 at 3:38 PM

Lisa Alazraki

@lisaalaz.bsky.social

We find the implicit setup without rationales is consistently superior in all cases. It also overwhelmingly outperforms CoT, even when we make this baseline more challenging by extending its context with additional, diverse question-answer pairs.

February 13, 2025 at 3:38 PM

Lisa Alazraki

@lisaalaz.bsky.social

We test these setups across multiple LLMs from different model families, multiple datasets of varying difficulty, and different fine-grained tasks: labelling an answer (or an individual reasoning step) as correct or not, editing an incorrect answer, and answering a new question.

February 13, 2025 at 3:38 PM

Lisa Alazraki

@lisaalaz.bsky.social

We construct few-shot prompts containing mathematical reasoning questions, alongside incorrect and correct answers. We compare this simple, implicit setup to the one that additionally includes explicit rationales illustrating how to turn an incorrect answer into a correct one.

February 13, 2025 at 3:38 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news