Hi folks,
I am a newbie to T5 and transformers in general so apologies in advance for any stupidity or incorrect assumptions on my part!
I am trying to put together an example of fine-tuning the T5 model to use a custom dataset for a custom task. I have the “How to fine-tune a model on summarization” example notebook working but that example uses a pre-configured HF dataset via “load_dataset()” not a custom dataset that I load from disk. So I was wanting to combine that example with the guidance given at “Fine-tuning with custom datasets” but with T5 and not DistilBert as in the fine-tuning example shown.
I think my main problem is knowing how to construct a dataset object that the pre-configured T5 model can consume. So here is my use of the tokenizer and my attempt at formating the tokenized sequencies into datasets:
But I get the following error back when I call trainer.train():
I have seen the post “Defining a custom dataset for fine-tuning translation” but the solution offered there seems to be write your own custom Dataset loading class rather than directly providing a solution to the problem - I can try to learn/do this but it would be great to get this working equivalent to “Fine-tuning with custom datasets” but for the T5 model I want to use.
I also found “Fine Tuning Transformer for Summary Generation” which is where I got the idea to change the getitem method of my ToxicDataset class to return “input_ids” “input_mask” “output_ids” “output_mask” but I am guessing really, I can’t find any documentation of what is needed (sorry!).
Any help or pointers to find what I need would be very much appreciated!