Language, Reasoning, and Cognition
https://opedal.github.io
We argue that to properly evaluate a model’s reasoning ability, it must be tested on problems that are harder than the ones it has already seen. Enter MathGAP, an evaluation framework for math word problems with arbitrarily complex proofs🧵
arxiv.org/abs/2410.13502
We argue that to properly evaluate a model’s reasoning ability, it must be tested on problems that are harder than the ones it has already seen. Enter MathGAP, an evaluation framework for math word problems with arbitrarily complex proofs🧵
arxiv.org/abs/2410.13502