We’re great at evaluating text-based reasoning (MATH, AIME…) but what about long-horizon visual reasoning?
Enter 𝗞𝗻𝗼𝘁𝗚𝘆𝗺: a minimalistic testbed for evaluating agents on spatial reasoning along a difficulty ladder
We’re great at evaluating text-based reasoning (MATH, AIME…) but what about long-horizon visual reasoning?
Enter 𝗞𝗻𝗼𝘁𝗚𝘆𝗺: a minimalistic testbed for evaluating agents on spatial reasoning along a difficulty ladder