Cem Koç
cemkoch.bsky.social
Cem Koç
@cemkoch.bsky.social
Coffee Lover • Husky Dad • ML Researcher @  • Berkeley Grad
Today we have released the code and a demo iOS application for FastVLM - our extremely efficient and fast vision language model which runs on your device using MLX! You can check out the code and the app here: github.com/apple/ml-fas...
May 7, 2025 at 10:20 PM
What is exciting is that FastVLM model family (VLMs with FastViTHD vision backbone) scales very well with more SFT data, which is vital, and achieves SOTA performance while being significantly faster 🚀
December 19, 2024 at 7:10 PM
We ran multiple experiments comparing different resolution sizes (256, 512, 768, 1024) and LLM sizes (0.5B, 1.5B, 7B) to find the optimal setup. FastViTHD's Pareto-optimal curve shows significant gains over FastViT (which is already better than ViTs)👇
December 19, 2024 at 6:58 PM
Text-rich tasks require high image resolutions which increase the vision encoding latency + number of image tokens which then leads to higher LLM pre-filling time. Therefore instead of using an isotropic architecture we use a hybrid vision backbone that can scale to higher input resolutions.
December 19, 2024 at 6:50 PM
We measure time-to-first-token (TTFT) as the wait time to get the first token response from the VLM which combines the Vision Encoder Latency + LLM pre-filling time (time it takes for LLM to fill the KV-cache and output its first token) and at high resolutions vision encoder latency dominates.
December 19, 2024 at 6:42 PM
FastVLM incorporates FastViTHD, a novel hybrid vision encoder backbone designed to output fewer image tokens and significantly reduce the encoding time for high resolution images.
December 19, 2024 at 6:34 PM
Excited about vision-language models? 🚀 Check out our latest work on FastVLM, a new family of efficient vision-language models that balances the tradeoff between high-resolution image understanding and latency without compromising accuracy!

arxiv.org/abs/2412.13303
December 19, 2024 at 6:18 PM