Confusions about how T5 is pretrained on C4 dataset

f3rry · January 30, 2023, 10:28am

As the C4 data instance showcased at c4 · Datasets at Hugging Face,

it seems each data item is composed of several sentences. The question is whether these sentences are tokenized and fed into the T5 model all together as one data item or if we should first split an instance into several sequences according to punctuation and each serves as a data item.

Topic		Replies	Views
How is T5 pretrained? 🤗Transformers	3	510	July 12, 2021
Missing pretraining datasets for T5 models 🤗Hub	0	911	January 27, 2022
Question on HuggingFace's T5 documenation 🤗Transformers	0	320	May 18, 2023
Train a t5 model 🤗Transformers	1	249	September 4, 2023
Finetuning T5 for Summarisation - Poor results Intermediate	1	528	April 28, 2024

Confusions about how T5 is pretrained on C4 dataset

Related topics