How does the Transformer handle different batch sizes?

As far as I understand, input sequences are padded in the tokenizer to have the length of the largest sequence in the batch (dynamic padding). However, different batches can have different sizes since the length of the largest sequence varies from batch to batch.

So, even though input samples have all the same length for a batch, this is not true for different batches. Then, how does the transformer handle batches of different sizes?

This is a diagram of the Multi Attention Head in the Transformer’s paper “Attention is All You Need”, which is the first layer in the encoder block in the Transformer.

In this representation, for example, the Queries have size (sequence_length, 64), where sequence_length is fixed, it cannot vary from batch to batch. Hence, my confusion. Each batch has shape (batch_size, max_sequence_len, embed_dim) and max_sequence_len varies depending on the batch.

Thanks in advance.

If you look at the definition of MultiheadAttention module, it only requires that key and value are of the same dimension, say L1 * N * E, and query can be of another dimension, say L2 * N * E.
L is the corresponding batch’s largest sequence length, N is the batch size and E is the embedding size.

The output is then the same dimension of query, i.e. L2 * N * E.

Across different batches, the MultiheadAttention module only requires E to be a constant. Both L1 and L2 can change across batches.

I had the same question, I just wanted to ask the @alejopaullier how that diagram was made? It looks great.

Diagrams are from this excellent video

I also made a Kaggle notebook on the topic in case you are interested :+1:t3: