arxiv.org/abs/2503.08640
github.com/millix19/dbsa
Thank you to my collaborators! Chin-Jou Li, Yilin Zhang,
@abertsch.bsky.social @gneubig.bsky.social
arxiv.org/abs/2503.08640
github.com/millix19/dbsa
Thank you to my collaborators! Chin-Jou Li, Yilin Zhang,
@abertsch.bsky.social @gneubig.bsky.social
- preceding context + attention sink are both critical for making block-sparse attention work without additional training.
- grouping examples for encoding & retrieval also boosts performance vs. purely individual retrieval.
[5/n]
- preceding context + attention sink are both critical for making block-sparse attention work without additional training.
- grouping examples for encoding & retrieval also boosts performance vs. purely individual retrieval.
[5/n]
Yes, caching thousands of examples can be large. However, it’s also easy to re-compute if needed—unlike fine-tuned parameters, which also requires substantial storage space for a large number of tasks and are often stored indefinitely.
[4/n]
Yes, caching thousands of examples can be large. However, it’s also easy to re-compute if needed—unlike fine-tuned parameters, which also requires substantial storage space for a large number of tasks and are often stored indefinitely.
[4/n]
We evaluate DBSA with Llama models, and up to 90k context length. DBSA achieves comparable per request latency to fine-tuning while maintaining on average >95% of the best accuracy.
[3/n]
We evaluate DBSA with Llama models, and up to 90k context length. DBSA achieves comparable per request latency to fine-tuning while maintaining on average >95% of the best accuracy.
[3/n]
- DBSA pre-encodes the many-shot examples with streaming block-sparse attention, allowing constant encoding time for new demos.
- During inference, it dynamically selects relevant KV chunks for each test query, given any retrieval method.
[2/n]
- DBSA pre-encodes the many-shot examples with streaming block-sparse attention, allowing constant encoding time for new demos.
- During inference, it dynamically selects relevant KV chunks for each test query, given any retrieval method.
[2/n]