How does huggingface T5 flax pretraining script handles very long sentences?

StephennFernandes · May 4, 2022, 6:55am

Hey there, so i am planning to pretrain T5 using the huggingface T5 pretraining script.
i would be using the OSCAR 2201 version of the corpus. However its been bothering me that usually most sentences in the corpus are beyond 512 sequence length. as T5 only has a sequence length of 512 that too considering post tokenization, what would happen to sequences that are longer ? does the pretraining script handle long sequences to be trimmed and the resulting substring to be passed in the next sample iter ? or does the sequence get cut until it could be fed into the model ?

Topic		Replies	Views
Flan-T5 - Finetuning to a Longer Sequence Length (512 -> 2048 tokens): Will it work? Beginners	3	4188	January 9, 2024
Padding for T5-flax pre-training on protein sequences Flax/JAX Projects	0	771	November 29, 2022
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
PreTrain T5 from scratch in Bengali Flax/JAX Projects	5	2206	July 26, 2022
Token indices sequence length is longer than the specified maximum sequence length 🤗Tokenizers	4	23045	February 15, 2023

How does huggingface T5 flax pretraining script handles very long sentences?

Related topics