Or will it process them correctly since it uses relative attention? To put it differently: does the memory and processing power it uses depend on the actual longest sequence within the current batch? So then I could speed up training by putting sequences with similar length together in a batch. Or every batch will be padded/truncated to the length specified in the config?
Great question! The 512 in T5’s config is a bit misleading since it is not a hard limit. T5 was mostly trained using 512 input tokens, however, thanks to its use of relative attention it can use much longer input sequences. This means that if you increase your input length more and more you won’t get a
"index out of positional embedding matrix" error you will get for other models, but you’ll eventually get a “out of memory CUDA” error.
T5 does use “normal” attention meaning that memory consumption scales quadractically (n^2) with the input length. So for T5 it makes a lot of sense to use padding/trunctuation and trying to have batches of similar length.
Thank you! I also wonder if TPU training can also support this “group by length” trick? The doc says TPU does not support dynamic shapes and I guess when each batch has different sequence length dimension that counts as dynamic.