Alaa El-Nouby
alaaelnouby.bsky.social
Alaa El-Nouby
@alaaelnouby.bsky.social
Research Scientist at @Apple. Previous: @Meta (FAIR), @Inria, @MSFTResearch, @VectorInst and @UofG . Egyptian 🇪🇬
The open-sourced AIMv2 checkpoints support a number of fixed resolutions (224px, 336px, and 448px) in addition to a Native resolution checkpoint that accepts images of variable resolutions and aspect ratios
November 22, 2024 at 8:32 AM
AIMv2 provides a strong off-the-shelf recognition performance, with AIMv2-3B achieving 89.5% on ImageNet with a frozen-trunk. We also observe consistent improvement in performance with scaling the parameters for AIMv2 (check Section.3 in the preprint)
November 22, 2024 at 8:32 AM
AIMv2 is pre-trained in a manner similar to modern VLMs; therefore, it can be integrated seamlessly with our smallest backbone (i.e., AIMv2-L), outperforming popular backbones such as OpenAI CLIP and SigLIP on multimodal understanding benchmarks
November 22, 2024 at 8:32 AM
AIMv2 is pre-trained to autoregressively generate image patches and text tokens. It is easy to implement and train and it can be trivially scaled to billions of parameters. We are sharing checkpoints ranging between 300M and 3B params, available in Pytorch, JAX, and MLX on🤗
November 22, 2024 at 8:32 AM
𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...
November 22, 2024 at 8:32 AM