Lightnews — Scholar-powered news

Raphael Schumann

@schumann.bsky.social

1.8K followers 870 following 12 posts

Natural Language Processing PhD Student @ Heidelberg University.

https://schumann.pub

#NLP #NLProc #ML #AI

Posts Replies Media Videos

Raphael Schumann

@schumann.bsky.social

It also works with Flash Attention 2, although I don't see additional speedups. I don't think FA is optimized for generation.

October 13, 2023 at 11:35 AM

Raphael Schumann

@schumann.bsky.social

Turns out that with the right attention_mask and position_ids you can prefill tokens AND pad batches in huggingface transformers. This speeds up inference, especially if if each instance has the same system prompt prepended. Code below ↓

October 13, 2023 at 11:34 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news