Hey there, so i am planning to pretrain T5 using the huggingface T5 pretraining script.
i would be using the OSCAR 2201 version of the corpus. However its been bothering me that usually most sentences in the corpus are beyond 512 sequence length. as T5 only has a sequence length of 512 that too considering post tokenization, what would happen to sequences that are longer ? does the pretraining script handle long sequences to be trimmed and the resulting substring to be passed in the next sample iter ? or does the sequence get cut until it could be fed into the model ?