Alice Bizeul
banner
alicebizeul.bsky.social
Alice Bizeul
@alicebizeul.bsky.social
PhD student @ETH AI Center working on self-supervised representation learning | Previously @EPFL, @MIT, Research Intern @Amazon
Personal website: https://alicebizeul.github.io
[10/🧵] This work is the result of an amazing team effort w/ Julius von Kügelgen, Alain Ryser, Thomas Sutter, Bernhard Schölkopf, Julia Vogt

📜 arXiv: arxiv.org/abs/2502.06314
👩‍💻 Code: github.com/alicebizeul/...
From Pixels to Components: Eigenvector Masking for Visual Representation Learning
Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning. However, the common practice of masking random patches of pixels exhibits ce...
arxiv.org
March 19, 2025 at 8:44 PM
[9/🧵] As a result, PMAE’s masking ratio becomes a more interpretable and robust hyperparameter!

Unlike MAEs—where the optimal ratio varies across datasets—we show that masking PCs that account for 20% of the data variance consistently yields near-optimal performance.
March 19, 2025 at 8:44 PM
[8/🧵] What about the masking ratio?

In MAEs, this ratio represents the proportion of masked-out pixels.

In PMAE, we make the masking ratio more data-driven by leveraging PCA. The masking ratio now reflects the proportion of data variance captured by the set of masked PCs.
March 19, 2025 at 8:44 PM
[7/🧵] We show that PMAE outperforms MAEs in downstream image classification on CIFAR10, TinyImageNet and MedMNIST datasets.

Using a ViT-Tiny, we observe an average 38% improvement in linear probing performance compared to MAEs with the standard 75% masking ratio.
March 19, 2025 at 8:44 PM
[6/🧵] However, instead of working with a subset of pixels, the ViT processes the original image with a subset of its principal components (PCs) masked out. The model is then trained to output images that, when projected onto the masked PCs, match the ground truth.
March 19, 2025 at 8:44 PM
[5/🧵] Our approach, Principal Masked Autoencoders (PMAE), closely follows the design of the Masked Autoencoder (MAE): a Vision Transformer (ViT) encoder-decoder is trained to reconstruct missing information from the visible parts.
March 19, 2025 at 8:44 PM
[4/🧵] We posit that this reduces the redundancy between visible and masked-out information and ensures the visible information is predictive of masked-out components.
March 19, 2025 at 8:44 PM
[3/🧵] Need a refresher on PCA?

For natural images, projecting data into its principal components partitions the information into a set of global features.

By masking principal components instead of raw pixels, we effectively mask more global rather than local features.
March 19, 2025 at 8:44 PM
[2/🧵] What if, instead of masking pixels, we mask information in a more meaningful space using off-the-shelf image transformations?

We keep it simple: we consider the space of principal components and reconstruct masked-out principal components instead of raw pixels.
March 19, 2025 at 8:44 PM
[1/🧵] Unlike text, images are not compact representations. Masking and reconstructing 75% of raw pixels—a common practice in MIM—can thus lead to failure cases:
❌ Visible pixels may be redundant with the masked ones.
❌ Visible pixels may be unpredictive of the masked regions.
March 19, 2025 at 8:44 PM