Hafez Ghaemi
hafezghm.bsky.social
Hafez Ghaemi
@hafezghm.bsky.social
Ph.D. Student @mila-quebec.bsky.social and @umontreal.ca, AI Researcher
Huge thanks to my supervisors and co-authors @neuralensemble.bsky.social and @shahabbakht.bsky.social !

Check out the full paper here: 📄 arxiv.org/abs/2505.03176

💻 Code coming soon!
📬 DM me if you’d like to chat or discuss the paper!

(10/10)
seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with re...
arxiv.org
May 14, 2025 at 12:53 PM
Interestingly, seq-JEPA shows path integration capabilities – an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.

(9/10)
May 14, 2025 at 12:53 PM
Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.

8/10
May 14, 2025 at 12:53 PM
On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.

(7/10)
May 14, 2025 at 12:53 PM
Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.

(6/10)
May 14, 2025 at 12:53 PM
Post-training, the model has learned two segregated representations:

An action-invariant aggregate representation
Action-equivariant individual-view representations

💡No explicit equivariance loss or dual predictor required!

(5/10)
May 14, 2025 at 12:53 PM
Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).

➡️ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.

(4/10)
May 14, 2025 at 12:53 PM
🧠 Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.

(3/10)
May 14, 2025 at 12:53 PM
Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.

2/10
May 14, 2025 at 12:53 PM