Whats the maths behind padding_to_longest vs padding_to_model_max_len?

I can’t quite work out why padding to the longest sequence in a batch is any better than padding to the full model input i.e. 512. How are the matrix multiplications different for an input thats say [batch_size, 100] vs [batch_size, 512] - is there some sort of input packing going on? Empirically longest method is faster.

Where I’m coming from is that the pre-trained weights are a fixed size i.e. d x 512

YEs but your hidden states have three dimensions: batch_size x seq_len x d. While d is fixed by the model, the seq_len varies and having it as short as possible will be faster.

1 Like