Hi I read the T5 model overview and saw that T5 can be pretrained on LM task using the following code
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
However, I’m confused about how to prepare my text data before doing unsupervised learning on T5
If I have a .txt, how should I transform my text into input_ids and labels pair shown in the above code with special tokens as masks (e.g. ‘The <extra_id_0> walks in <extra_id_1> park’).
Also, I tried the example T5MLM code provided here, but there is some issues with my CUDA driver when running code using jax as backend, so the training could only be started on CPU instead of GPU. So I am looking for a way to prepare the data and do pretraining using PyTorch.
Thanks for helping, really appreciate your time
Michael