Does T5 truncate input longer than 512 internally?

marton-avrios · February 11, 2021, 3:01pm

Or will it process them correctly since it uses relative attention? To put it differently: does the memory and processing power it uses depend on the actual longest sequence within the current batch? So then I could speed up training by putting sequences with similar length together in a batch. Or every batch will be padded/truncated to the length specified in the config?

patrickvonplaten · February 12, 2021, 7:14am

Great question! The 512 in T5’s config is a bit misleading since it is not a hard limit. T5 was mostly trained using 512 input tokens, however, thanks to its use of relative attention it can use much longer input sequences. This means that if you increase your input length more and more you won’t get a "index out of positional embedding matrix" error you will get for other models, but you’ll eventually get a “out of memory CUDA” error.

T5 does use “normal” attention meaning that memory consumption scales quadractically (n^2) with the input length. So for T5 it makes a lot of sense to use padding/trunctuation and trying to have batches of similar length.

marton-avrios · February 12, 2021, 7:20am

Thank you! I also wonder if TPU training can also support this “group by length” trick? The doc says TPU does not support dynamic shapes and I guess when each batch has different sequence length dimension that counts as dynamic.

Topic		Replies	Views
T5 tokenizer / ideal method of calculating max_sequence_length? 🤗Transformers	1	544	May 22, 2024
Can I train pytorch T5 on TPU with variable batch shape? 🤗Transformers	2	299	March 6, 2021
Token Classification Models on (Very) Long Text Models	8	11218	March 9, 2023
Inference input token number set as the max length always? Beginners	10	1342	April 21, 2024
Out of Memory on very small custom transformer Models	7	2130	October 12, 2020

Does T5 truncate input longer than 512 internally?

Related topics