As far as I understand, input sequences are padded in the tokenizer to have the length of the largest sequence in the batch (dynamic padding). However, different batches can have different sizes since the length of the largest sequence varies from batch to batch.
So, even though input samples have all the same length for a batch, this is not true for different batches. Then, how does the transformer handle batches of different sizes?
This is a diagram of the Multi Attention Head in the Transformer’s paper “Attention is All You Need”, which is the first layer in the encoder block in the Transformer.
In this representation, for example, the Queries have size
(sequence_length, 64), where
sequence_length is fixed, it cannot vary from batch to batch. Hence, my confusion. Each batch has shape
(batch_size, max_sequence_len, embed_dim) and
max_sequence_len varies depending on the batch.
Thanks in advance.