Batched Generation with Flash Attention

swtb · May 24, 2024, 2:12pm

In the link above, they talk about batching with flash attention. Though They seem to say that we should put all batches into one sequence rather than the usualy batching and padding approach.

Im really quite lost, it would be really useful to see an example of how to implement this.

GregTarr · December 30, 2024, 3:19pm

Bit of a late response here, but I believe I can provide an answer. I was curious as to the reason flex-attention (PyTorch’s implementation of flash attention et al.) worked much better when without batching and found the answer to your question in my search.

Anyways, what you’re looking for is to create a block mask (a matrix that defines which queries can attend to which keys) in which the tokens in the same batch (before flattening) can only attend to tokens that are: a) prior/causal, and b) in the same batch.

FlexAttention calls this document masking. An implementation can be seen here:

Under “Document Masking Jagged Sequences”

Topic		Replies	Views
Transformers llama flash_attn_varlen_func questions 🤗Transformers	0	250	July 29, 2024
FlashAttention or equivalent? 🤗Transformers	0	930	April 30, 2023
How is padding masking considered in the Attention Head of a Transformer? 🤗Transformers	0	2770	December 6, 2022
Dynamic attention mask during GPT-2 training 🤗Transformers	0	855	December 11, 2020
Customizing GenerationMixin to output attentions Beginners	4	1878	September 10, 2020

Batched Generation with Flash Attention

Related topics