Whats the maths behind padding_to_longest vs padding_to_model_max_len?

chrisdoyle · July 20, 2022, 11:01am

I can’t quite work out why padding to the longest sequence in a batch is any better than padding to the full model input i.e. 512. How are the matrix multiplications different for an input thats say [batch_size, 100] vs [batch_size, 512] - is there some sort of input packing going on? Empirically longest method is faster.

Where I’m coming from is that the pre-trained weights are a fixed size i.e. d x 512

sgugger · July 20, 2022, 2:41pm

YEs but your hidden states have three dimensions: batch_size x seq_len x d. While d is fixed by the model, the seq_len varies and having it as short as possible will be faster.

Topic		Replies	Views
Why does padding = 'max_length' cause much slower model inference? Models	1	620	June 8, 2023
T5 instruction finetuning Models	0	48	September 9, 2024
Padding strategy for classification Beginners	3	2483	July 20, 2020
Padding in Decision Transformers Inference Models	0	296	May 22, 2023
Issue with batching long sequences Beginners	0	7	July 16, 2024

Whats the maths behind padding_to_longest vs padding_to_model_max_len?

Related topics