refractai.bsky.social
@refractai.bsky.social
Yes for prompt processing, based on github.com/ggml-org/lla... it scales near linearly with GPU core count (FLOPS).
Performance of llama.cpp on Apple Silicon M-series · ggml-org llama.cpp · Discussion #4167
Summary LLaMA 7B BW [GB/s] GPU Cores F16 PP [t/s] F16 TG [t/s] Q8_0 PP [t/s] Q8_0 TG [t/s] Q4_0 PP [t/s] Q4_0 TG [t/s] ✅ M1 1 68 7 108.21 7.92 107.81 14.19 ✅ M1 1 68 8 117.25 7.91 117.96 14.15 ✅ M1...
github.com
March 10, 2025 at 12:10 AM
Prompt processing is compute bound (raw FLOPS) instead of memory bandwidth bound like token generation. M2 Max is 13TFLOPS. Nvidia 3090 is 35TFLOPS. It's just the Mac's GPU being small.
March 9, 2025 at 11:00 PM
JetFormer isn't on there yet right?
December 3, 2024 at 1:47 AM