Prepare data for pretraining T5 model

Hi I read the T5 model overview and saw that T5 can be pretrained on LM task using the following code

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss

However, I’m confused about how to prepare my text data before doing unsupervised learning on T5
If I have a .txt, how should I transform my text into input_ids and labels pair shown in the above code with special tokens as masks (e.g. ‘The <extra_id_0> walks in <extra_id_1> park’).

Also, I tried the example T5MLM code provided here, but there is some issues with my CUDA driver when running code using jax as backend, so the training could only be started on CPU instead of GPU. So I am looking for a way to prepare the data and do pretraining using PyTorch.

Thanks for helping, really appreciate your time
Michael

Did you solve your problem? I have the same question