Lightnews — Scholar-powered news

Alice Bizeul

@alicebizeul.bsky.social

PhD student @ETH AI Center working on self-supervised representation learning | Previously @EPFL, @MIT, Research Intern @Amazon
Personal website: https://alicebizeul.github.io

Posts Replies Media Videos

Alice Bizeul

@alicebizeul.bsky.social

[10/🧵] This work is the result of an amazing team effort w/ Julius von Kügelgen, Alain Ryser, Thomas Sutter, Bernhard Schölkopf, Julia Vogt

📜 arXiv: arxiv.org/abs/2502.06314
👩‍💻 Code: github.com/alicebizeul/...

From Pixels to Components: Eigenvector Masking for Visual Representation Learning

Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning. However, the common practice of masking random patches of pixels exhibits ce...

arxiv.org

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[9/🧵] As a result, PMAE’s masking ratio becomes a more interpretable and robust hyperparameter!

Unlike MAEs—where the optimal ratio varies across datasets—we show that masking PCs that account for 20% of the data variance consistently yields near-optimal performance.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[8/🧵] What about the masking ratio?

In MAEs, this ratio represents the proportion of masked-out pixels.

In PMAE, we make the masking ratio more data-driven by leveraging PCA. The masking ratio now reflects the proportion of data variance captured by the set of masked PCs.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[7/🧵] We show that PMAE outperforms MAEs in downstream image classification on CIFAR10, TinyImageNet and MedMNIST datasets.

Using a ViT-Tiny, we observe an average 38% improvement in linear probing performance compared to MAEs with the standard 75% masking ratio.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[6/🧵] However, instead of working with a subset of pixels, the ViT processes the original image with a subset of its principal components (PCs) masked out. The model is then trained to output images that, when projected onto the masked PCs, match the ground truth.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[5/🧵] Our approach, Principal Masked Autoencoders (PMAE), closely follows the design of the Masked Autoencoder (MAE): a Vision Transformer (ViT) encoder-decoder is trained to reconstruct missing information from the visible parts.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[4/🧵] We posit that this reduces the redundancy between visible and masked-out information and ensures the visible information is predictive of masked-out components.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[3/🧵] Need a refresher on PCA?

For natural images, projecting data into its principal components partitions the information into a set of global features.

By masking principal components instead of raw pixels, we effectively mask more global rather than local features.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[2/🧵] What if, instead of masking pixels, we mask information in a more meaningful space using off-the-shelf image transformations?

We keep it simple: we consider the space of principal components and reconstruct masked-out principal components instead of raw pixels.

March 19, 2025 at 8:44 PM

Alice Bizeul

@alicebizeul.bsky.social

[1/🧵] Unlike text, images are not compact representations. Masking and reconstructing 75% of raw pixels—a common practice in MIM—can thus lead to failure cases:
❌ Visible pixels may be redundant with the masked ones.
❌ Visible pixels may be unpredictive of the masked regions.

March 19, 2025 at 8:44 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news