Andreas Opedal
andreasopedal.bsky.social
Andreas Opedal
@andreasopedal.bsky.social
PhD student at ETH Zurich & MPI-IS in NLP & ML
Language, Reasoning, and Cognition
https://opedal.github.io
All models are sensitive to a simple change in sentence ordering, where we take one sentence and move it to the beginning. We also find that the problem is easiest for LLMs if the sentence is moved from near the beginning or end, rather than from the middle!
March 14, 2025 at 4:14 PM
OpenAI’s o1 and DeepSeek-R1 are certainly impressive. However, when we permuted the ordering of the sentences their performance went down to 5% and 11% respectively (with the token limit set to 25,000 as recommended by OpenAI).
March 14, 2025 at 4:14 PM
Here are the results for what we call “nonlinear” problems. Solving them requires keeping intermediate results in memory for subsequent steps before they can be used for further deduction. The most complex problems are pretty hard for all models, but they are still able to solve some of them!
March 14, 2025 at 4:14 PM
With our proof system we can generate new MWPs that adhere to the structure of proof trees, as well as ground-truth CoT traces! From the proof trees we then characterize the complexity of reasoning in several ways, e.g., depth, width, shape, and ordering of nodes (i.e., sentences).
March 14, 2025 at 4:14 PM
New #ICLR2025 paper 📣📣

We argue that to properly evaluate a model’s reasoning ability, it must be tested on problems that are harder than the ones it has already seen. Enter MathGAP, an evaluation framework for math word problems with arbitrarily complex proofs🧵

arxiv.org/abs/2410.13502
March 14, 2025 at 4:14 PM