I can’t quite work out why padding to the longest sequence in a batch is any better than padding to the full model input i.e. 512. How are the matrix multiplications different for an input thats say [batch_size, 100] vs [batch_size, 512] - is there some sort of input packing going on? Empirically longest method is faster.
Where I’m coming from is that the pre-trained weights are a fixed size i.e. d x 512