Peter Gray
peteryugray.bsky.social
Peter Gray
@peteryugray.bsky.social
AI / ML comms person (formerly Meta, Linden Lab). Guitar in Butterfly Knives. Vespa enthusiast.
tl;dr: some parameters are much more important than others, and in some cases removing just 1 can turn an LLM's output to nonsense
August 21, 2025 at 6:13 PM
The inference code, model checkpoints, and an iOS/macOS demo app based on MLX are available here: github.com/apple/ml-fas...
GitHub - apple/ml-fastvlm: This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025 - apple/ml-fastvlm
github.com
July 23, 2025 at 6:35 PM
How fast is it? Here's the demo app running FastVLM 0.5B model on iPhone 16 Pro. Time to first token is shown on the screen, highlighting near real-time performance.
July 23, 2025 at 6:35 PM
And for a comprehensive overview of Apple research at the conference - including the complete schedule of orals, posters, workshops, booth programming and more - see this post: machinelearning.apple.com/updates/appl...
LinkedIn
This link will take you to a page that’s not on LinkedIn
lnkd.in
July 11, 2025 at 5:12 PM
Accepted as a Spotlight at @iclr-conf.bsky.social the work shares a new method for fine-grained control over #genAI output - without the computational overhead, complexity, and volume of data needed by #RLHF or fine-tuning, and with more reliable results than prompt engineering.
April 10, 2025 at 5:28 PM
Congratulations!
January 16, 2025 at 6:26 PM
Devs can now benefit from faster inference for their production LLMs on NVIDIA GPUs - benchmarking shows 2.7x acceleration in token generation 5/5
December 18, 2024 at 10:15 PM
o make this advancement production-ready for NVIDIA GPUs, the team collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM framework: developer.nvidia.com/blog/nvidia-... 4/5
NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference | NVIDIA Technical Blog
Recurrent drafting (referred as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT-LLM...
developer.nvidia.com
December 18, 2024 at 10:15 PM
Earlier this year, Apple Machine Learning researchers published & open sourced ReDrafter, a novel approach to speculative decoding: machinelearning.apple.com/research/rec... 2/5
December 18, 2024 at 10:15 PM