Ray
banner
rajky.bsky.social
Ray
@rajky.bsky.social
PhD Researcher at Aalto - rkhz.github.io
Work by: Raymond Khazoum (@rajky.bsky.social), Daniela Fernandes, Aleksandr Krylov, Qin Li, and Stéphane Deny (@stphtphsn.bsky.social)


Paper: “A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments” (arxiv.org/abs/2512.13517)

Thanks for reading this thread!
December 18, 2025 at 9:46 AM
To conclude:
While questions remain (see Discussion), in this work we present a model that provides a mechanistic account of mental rotation and highlights how deep, equivariant, and symbolic representations can support spatial reasoning in artificial systems.
December 18, 2025 at 9:46 AM
We validate the model by comparing it to human behavior: the model achieves 96% accuracy on the mental rotation task and replicates the minimal number of actions taken by humans in interactive VR experiments.

(Systematic ablations demonstrate the necessity of each module.)
December 18, 2025 at 9:46 AM
Now that we’ve established the symbolic engine, how does our model use it to solve mental rotation?

This brings us to the third module: an MLP that sequentially predicts, from pairs of symbolic codes, similarity decisions and rotation actions to apply to the 3D latent space.
December 18, 2025 at 9:46 AM
Based on our VR results, we propose the Quadrant Hypothesis: objects are mentally placed into a visual "quadrant," abstracting object’s pose to quadrant membership.

Each object has a unique code per quadrant; mental rotation reduces to switching quadrants until alignment.
December 18, 2025 at 9:46 AM
But wait a minute! How did we come up with this symbolic description?

We ran interactive VR experiments where participants could rotate objects using a thumbstick.

We found that participants typically take a single action to roughly align the objects before judging similarity.
December 18, 2025 at 9:46 AM
For the second module, we propose the Vision-Symbolic Model, an attention-based architecture that converts the 3D representation from the first module into a symbolic description of the object.
December 18, 2025 at 9:46 AM
For the first module, we adopt the Equivariant Neural Renderer proposed by Dupont et al. (2020, arxiv.org/abs/2006.07630).

The autoencoder extracts a 3D-structured latent representation from a 2D view of an object, and novel views can be synthesized by rotating the latent space.
December 18, 2025 at 9:46 AM
Guided by prior studies of mental rotation and our interactive VR version of the task, we designed a model made of three stacked modules.

Each module handles a specific step, and together, they sequentially solve the task and account for the underlying process of mental rotation.
December 18, 2025 at 9:46 AM
A seminal investigation by Shepard & Metzler (1971, jstor.org/stable/1731476) asked subjects to judge the similarity of two 3D shapes from different views.

Reaction times grew with angular differences, even for depth rotations, suggesting humans can mentally infer and manipulate 3D representations
December 18, 2025 at 9:46 AM