Isaac
isaac-gerber.bsky.social
Isaac
@isaac-gerber.bsky.social
Data science, AI, ML.

Sci fi and fantasy books and gaming
The key idea is to split up context encoding from query processing/token generation. The context encoding stage is then split across multiple blocks for parallel computation.
November 28, 2024 at 2:01 PM
There's a tunable parameter that lets you balance the tradeoff between accuracy and speed. And best of all, it operates on a different mechanism from other popular LLM optimization techniques like Flash Attention and KV cache compression meaning that you can combine them for more speed improvements.
November 28, 2024 at 2:01 PM
but are they too hot?
November 27, 2024 at 1:52 AM
for now
November 26, 2024 at 8:28 PM
true but is rothfuss ever going to finish it? i’d love you to be right!
November 25, 2024 at 1:54 AM
this seems sadly most likely to me too
November 25, 2024 at 1:53 AM