How does the Transformer handle different batch sizes?

alejopaullier · March 17, 2023, 11:24pm

As far as I understand, input sequences are padded in the tokenizer to have the length of the largest sequence in the batch (dynamic padding). However, different batches can have different sizes since the length of the largest sequence varies from batch to batch.

So, even though input samples have all the same length for a batch, this is not true for different batches. Then, how does the transformer handle batches of different sizes?

This is a diagram of the Multi Attention Head in the Transformer’s paper “Attention is All You Need”, which is the first layer in the encoder block in the Transformer.

In this representation, for example, the Queries have size (sequence_length, 64), where sequence_length is fixed, it cannot vary from batch to batch. Hence, my confusion. Each batch has shape (batch_size, max_sequence_len, embed_dim) and max_sequence_len varies depending on the batch.

Thanks in advance.

xwjiang2010 · April 27, 2023, 9:10pm

Hi,
If you look at the definition of MultiheadAttention module, it only requires that key and value are of the same dimension, say L1 * N * E, and query can be of another dimension, say L2 * N * E.
L is the corresponding batch’s largest sequence length, N is the batch size and E is the embedding size.

The output is then the same dimension of query, i.e. L2 * N * E.

Across different batches, the MultiheadAttention module only requires E to be a constant. Both L1 and L2 can change across batches.

tonydavis629 · January 24, 2024, 7:22pm

I had the same question, I just wanted to ask the @alejopaullier how that diagram was made? It looks great.

alejopaullier · January 24, 2024, 8:01pm

Diagrams are from this excellent video

I also made a Kaggle notebook on the topic in case you are interested

Topic		Replies	Views
Importance of padding for tokens and same size inputs for transformers 🤗Transformers	1	680	October 22, 2021
Sequences shorter than model's input window size 🤗Transformers	2	1172	January 4, 2022
Data sampler based on number of tokens 🤗Transformers	0	730	February 4, 2022
Training with varying lengths of sequences Beginners	0	1610	May 31, 2023
How to set batchsize of inference Beginners	1	318	October 17, 2024

How does the Transformer handle different batch sizes?

Related topics