Lightnews — Scholar-powered news

Ray

@rajky.bsky.social

Work by: Raymond Khazoum (@rajky.bsky.social), Daniela Fernandes, Aleksandr Krylov, Qin Li, and Stéphane Deny (@stphtphsn.bsky.social)

 Paper: “A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments” (arxiv.org/abs/2512.13517)

Thanks for reading this thread!

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

To conclude:
While questions remain (see Discussion), in this work we present a model that provides a mechanistic account of mental rotation and highlights how deep, equivariant, and symbolic representations can support spatial reasoning in artificial systems.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

We validate the model by comparing it to human behavior: the model achieves 96% accuracy on the mental rotation task and replicates the minimal number of actions taken by humans in interactive VR experiments.

(Systematic ablations demonstrate the necessity of each module.)

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

Now that we’ve established the symbolic engine, how does our model use it to solve mental rotation?

This brings us to the third module: an MLP that sequentially predicts, from pairs of symbolic codes, similarity decisions and rotation actions to apply to the 3D latent space.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

Based on our VR results, we propose the Quadrant Hypothesis: objects are mentally placed into a visual "quadrant," abstracting object’s pose to quadrant membership.

Each object has a unique code per quadrant; mental rotation reduces to switching quadrants until alignment.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

But wait a minute! How did we come up with this symbolic description?

We ran interactive VR experiments where participants could rotate objects using a thumbstick.

We found that participants typically take a single action to roughly align the objects before judging similarity.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

For the second module, we propose the Vision-Symbolic Model, an attention-based architecture that converts the 3D representation from the first module into a symbolic description of the object.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

For the first module, we adopt the Equivariant Neural Renderer proposed by Dupont et al. (2020, arxiv.org/abs/2006.07630).

The autoencoder extracts a 3D-structured latent representation from a 2D view of an object, and novel views can be synthesized by rotating the latent space.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

Guided by prior studies of mental rotation and our interactive VR version of the task, we designed a model made of three stacked modules.

Each module handles a specific step, and together, they sequentially solve the task and account for the underlying process of mental rotation.

December 18, 2025 at 9:46 AM

Ray

@rajky.bsky.social

A seminal investigation by Shepard & Metzler (1971, jstor.org/stable/1731476) asked subjects to judge the similarity of two 3D shapes from different views.

Reaction times grew with angular differences, even for depth rotations, suggesting humans can mentally infer and manipulate 3D representations

December 18, 2025 at 9:46 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news