Lightnews — Scholar-powered news

Alaa El-Nouby

@alaaelnouby.bsky.social

350 followers 60 following 8 posts

Research Scientist at @Apple. Previous: @Meta (FAIR), @Inria, @MSFTResearch, @VectorInst and @UofG . Egyptian 🇪🇬

Posts Replies Media Videos

Alaa El-Nouby

@alaaelnouby.bsky.social

Could you clarify for what task did you test the checkpoints and which checkpoint in particular did you use? Thanks!

November 22, 2024 at 4:41 PM

Alaa El-Nouby

@alaaelnouby.bsky.social

Hey Johan, For AIMv2 please use the last layer features, typically after the post trunk layer normalization.

November 22, 2024 at 4:40 PM

Alaa El-Nouby

@alaaelnouby.bsky.social

It has been an absolute pleasure working with Enrico, Mustafa and the whole AIMv2 team the past few months. We are looking forward to seeing our models being useful to the community.

For many more results, insights and analysis please check our preprint. arxiv.org/abs/2411.14402

Multimodal Autoregressive Pre-training of Large Vision Encoders

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal s...

arxiv.org

November 22, 2024 at 8:32 AM

Alaa El-Nouby

@alaaelnouby.bsky.social

The open-sourced AIMv2 checkpoints support a number of fixed resolutions (224px, 336px, and 448px) in addition to a Native resolution checkpoint that accepts images of variable resolutions and aspect ratios

November 22, 2024 at 8:32 AM

Alaa El-Nouby

@alaaelnouby.bsky.social

AIMv2 provides a strong off-the-shelf recognition performance, with AIMv2-3B achieving 89.5% on ImageNet with a frozen-trunk. We also observe consistent improvement in performance with scaling the parameters for AIMv2 (check Section.3 in the preprint)

November 22, 2024 at 8:32 AM

Alaa El-Nouby

@alaaelnouby.bsky.social

AIMv2 is pre-trained in a manner similar to modern VLMs; therefore, it can be integrated seamlessly with our smallest backbone (i.e., AIMv2-L), outperforming popular backbones such as OpenAI CLIP and SigLIP on multimodal understanding benchmarks

November 22, 2024 at 8:32 AM

Alaa El-Nouby

@alaaelnouby.bsky.social

AIMv2 is pre-trained to autoregressively generate image patches and text tokens. It is easy to implement and train and it can be trivially scaled to billions of parameters. We are sharing checkpoints ranging between 300M and 3B params, available in Pytorch, JAX, and MLX on🤗

November 22, 2024 at 8:32 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news