Prepare data for pretraining T5 model

MickyMike · October 4, 2021, 7:02am

Hi I read the T5 model overview and saw that T5 can be pretrained on LM task using the following code

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss

However, I’m confused about how to prepare my text data before doing unsupervised learning on T5
If I have a .txt, how should I transform my text into input_ids and labels pair shown in the above code with special tokens as masks (e.g. ‘The <extra_id_0> walks in <extra_id_1> park’).

Also, I tried the example T5MLM code provided here, but there is some issues with my CUDA driver when running code using jax as backend, so the training could only be started on CPU instead of GPU. So I am looking for a way to prepare the data and do pretraining using PyTorch.

Thanks for helping, really appreciate your time
Michael

CRini · May 4, 2023, 4:32am

Did you solve your problem? I have the same question

Topic		Replies	Views
How to denoise text using T5? 🤗Transformers	2	683	May 8, 2023
Data preparation for T5 model Beginners	1	436	May 7, 2023
Preprocessing for T5 Denoising Intermediate	1	2723	May 20, 2021
Errors when fine-tuning T5 Beginners	7	6472	January 3, 2022
I need help to run my code on mlm task Beginners	0	371	November 17, 2021

Prepare data for pretraining T5 model

Related topics