Lightnews — Scholar-powered news

Cem Koç

@cemkoch.bsky.social

23 followers 34 following 9 posts

Coffee Lover • Husky Dad • ML Researcher @  • Berkeley Grad

Posts Replies Media Videos

Cem Koç

@cemkoch.bsky.social

Today we have released the code and a demo iOS application for FastVLM - our extremely efficient and fast vision language model which runs on your device using MLX! You can check out the code and the app here: github.com/apple/ml-fas...

May 7, 2025 at 10:20 PM

Cem Koç

@cemkoch.bsky.social

What is exciting is that FastVLM model family (VLMs with FastViTHD vision backbone) scales very well with more SFT data, which is vital, and achieves SOTA performance while being significantly faster 🚀

December 19, 2024 at 7:10 PM

Cem Koç

@cemkoch.bsky.social

We ran multiple experiments comparing different resolution sizes (256, 512, 768, 1024) and LLM sizes (0.5B, 1.5B, 7B) to find the optimal setup. FastViTHD's Pareto-optimal curve shows significant gains over FastViT (which is already better than ViTs)👇

December 19, 2024 at 6:58 PM

Cem Koç

@cemkoch.bsky.social

Text-rich tasks require high image resolutions which increase the vision encoding latency + number of image tokens which then leads to higher LLM pre-filling time. Therefore instead of using an isotropic architecture we use a hybrid vision backbone that can scale to higher input resolutions.

December 19, 2024 at 6:50 PM

Cem Koç

@cemkoch.bsky.social

We measure time-to-first-token (TTFT) as the wait time to get the first token response from the VLM which combines the Vision Encoder Latency + LLM pre-filling time (time it takes for LLM to fill the KV-cache and output its first token) and at high resolutions vision encoder latency dominates.

December 19, 2024 at 6:42 PM

Cem Koç

@cemkoch.bsky.social

FastVLM incorporates FastViTHD, a novel hybrid vision encoder backbone designed to output fewer image tokens and significantly reduce the encoding time for high resolution images.

December 19, 2024 at 6:34 PM

Cem Koç

@cemkoch.bsky.social

Excited about vision-language models? 🚀 Check out our latest work on FastVLM, a new family of efficient vision-language models that balances the tradeoff between high-resolution image understanding and latency without compromising accuracy!

arxiv.org/abs/2412.13303

December 19, 2024 at 6:18 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news