Confusions about how T5 is pretrained on C4 dataset

As the C4 data instance showcased at c4 · Datasets at Hugging Face,


it seems each data item is composed of several sentences. The question is whether these sentences are tokenized and fed into the T5 model all together as one data item or if we should first split an instance into several sequences according to punctuation and each serves as a data item.