Raphael Schumann
schumann.bsky.social
Raphael Schumann
@schumann.bsky.social
Natural Language Processing PhD Student @ Heidelberg University.

https://schumann.pub

#NLP #NLProc #ML #AI
It also works with Flash Attention 2, although I don't see additional speedups. I don't think FA is optimized for generation.
October 13, 2023 at 11:35 AM
Turns out that with the right attention_mask and position_ids you can prefill tokens AND pad batches in huggingface transformers. This speeds up inference, especially if if each instance has the same system prompt prepended. Code below ↓
October 13, 2023 at 11:34 AM