Lightnews — Scholar-powered news

Hafez Ghaemi

@hafezghm.bsky.social

60 followers 44 following 11 posts

Ph.D. Student @mila-quebec.bsky.social and @umontreal.ca, AI Researcher

Posts Replies Media Videos

Hafez Ghaemi

@hafezghm.bsky.social

Huge thanks to my supervisors and co-authors @neuralensemble.bsky.social and @shahabbakht.bsky.social !

Check out the full paper here: 📄 arxiv.org/abs/2505.03176

💻 Code coming soon!
📬 DM me if you’d like to chat or discuss the paper!

(10/10)

seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with re...

arxiv.org

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

Interestingly, seq-JEPA shows path integration capabilities – an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.

(9/10)

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.

8/10

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.

(7/10)

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.

(6/10)

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

Post-training, the model has learned two segregated representations:

An action-invariant aggregate representation
Action-equivariant individual-view representations

💡No explicit equivariance loss or dual predictor required!

(5/10)

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).

➡️ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.

(4/10)

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

🧠 Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.

(3/10)

May 14, 2025 at 12:53 PM

Hafez Ghaemi

@hafezghm.bsky.social

Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.

2/10

May 14, 2025 at 12:53 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news