Finetuning T5 for a task

In the paper for T5, I noticed that the inputs to the model always a prefix (ex. “summarize: …” or “translate English to German: …”. When I finetune a T5 model, can I use any phrase/word that I want as a prefix, or can T5 only understand a specific predefined list of prefixes?


T5 only has been trained on a specific set of prefixes. You can find a list here: (starting at page 47)

That said, you can just finetune without a prefix (or with a custom prefix) and it should still work out.

1 Like

Thank you so much for your reply. If I wasn’t using a prefix, and I want to pass two sentences as input to the model during training, how would I format the input string?

@hugomontenegro Also, is there a maximum input length for a T5 model?

What exactly is your usecase? What’s the desired output of the two sentences? Perhaps just concatenating them with a separator might be sufficient.

As for input length, it’s unconstrained. T5 can take in an arbitrary sequence length, however, memory requirements still apply. Memory consumption scales quadratically with input sentence length, so you’ll quickly run out of it.


@hugomontenegro For example, if I am trying to predict a paraphrase given a context paragraph and a sentence to be paraphrased, how would I accomplish this (I am trying to input a “context” and a “sentence” and output a “paraphrase”)? Does spacing between the two parts of the input matter?

The whole point of the T5 paper was showing that purely by prepending a prefix multiple distinct tasks could be done, using the same model architecture, to close to SOTA levels.

That leads us to your question: can your problem be done with T5? The answer is yeah, probably.

As to how to format the input for this task I’d probably try the following:

If we have the following input:
Input: {‘context’: ‘food topics’, ‘sentence’:‘sushi is a great dessert’}

Then I’d convert it into the following:
Processed Input: f"summarize: context: {context}; sentence: {sentence}"
(So: f"summarize: context: food topics; sentence: sushi is a great dessert")

The target is of course your paraphrase.

This way you separate context and sentence for the model, a separation which it should eventually learn with enough training examples. Also, I’ve reused the “summarize” keyword from T5, since it is vaguely similar to this task and might help a bit (especially initially).

Anyways this should work given enough training examples. Good Luck.

Thank you!

I currently preprocessed my dataset to the form {'input_ids': **tokenized ids of input**, 'attention_mask': **attention mask of input** , 'decoder_input_ids': **tokenized ids of output**, 'decoder_attention_mask': **attention mask of input**, 'labels': **tokenized ids of output**}, and ended up with a list of dictionaries in the above format.

However, when I pass this list of dictionaries to the Trainer class as the train_dataset, and call trainer.train(), I get the following error:

ValueError: too many values to unpack (expected 2)

Can you please give me advice on how to fix this? (Sorry for bombarding you with so many questions)

Sorry, I don’t have the time to help with debugging, and you’re better served anyways by going through the huggingface docs and adapting/understanding the code from a few examples.

In particular, these two links should be helpful:

and also this:

Take a look at those and adapt the code to your needs (especially preprocessing part).


@hugomontenegro Thanks so much for the links. I was able to get it working! :slight_smile:


Are the results after training any good? Interesting use case frankly. I’ve never seen anyone use NLP to paraphrase!

Overall, the model performs relatively well. I am still trying to find other paraphrasing datasets, to make my model more robust against edge cases.

1 Like

If anyone is curious, it is possible to invent/add a new prefix yourself for new tasks. I’ve done so in cases where I had a lot of data so I’m not sure how well it will work with smaller datasets. It’s unclear how well it transfers knowledge from the other tasks when you do this but my guess is it’s a lot better than starting from scratch. Parsing and creating basic representations of the input text is still helpful to achieving your task. Interestingly, it was still able to use the original prefixes and do translation etc. fairly well after training was completed on my large dataset containing only the new prefix.

The recommendation to reuse the summarization prefix is probably a good thing to try, it would be interesting to see results of reusing it vs not reusing it and adding a new prefix instead.

@Rbaten How many samples would you estimate that the dataset would need to be able to learn a new prefix?

My data had size 100k examples if I remember correctly. You can most likely get away with a lot less without a big trade-off on performance if you tune correctly but more data is almost always better. Maybe on the order of a couple hundred or thousand depending on the complexity of the task?

Note that details related to training and/or how you structure your inputs to be similar to what is seen during pretraining start to have larger impact when you go down to small dataset sizes. If you have a really small dataset and your task is similar enough to summarization, that’s when you may see some lift by trying to use the existing prompt. There was a paper by huggingface on prompts and data efficiency during fine tuning a while back. IMO, try both ways and see what works best, I’d be interested in hearing any results you come up with.

@Rbaten My dataset has 80K samples, but there is one part for the input and one for the output (there is a paragraph passed as input, and a paragraph received as output). For this scenario, do I even need to use a prompt/prefix with T5?

I tried to train without a prefix at all at first and t5 didn’t seem to handle that too well. Not that it didn’t work at all, it just didn’t work nearly as well for me than when I added the prefix. It seems to expect to parse out a prefix and base the rest of what it does fairly heavily on that.

Would suggest doing:

input ids = (Your custom prefix here): (input)
labels = (output)

You can play with different prefixes. If it describes the task (paraphrase), you may get better results earlier in training but your data is large enough that I think you can get away with using anything and as long as it’s consistent within the training run and you train long enough, it should work and produce similar results.

Thanks @Rbaten for all your help! I also trained a model for key-phrase extraction by passing it an input paragraph and training it to output the same paragraph, but with the key-phrases surrounded in ‘|||’ (ex. |||George Washington||| was a president).The model appeared to actually learn (the training and validation loss went down), but when I try to make a prediction with the model, it just returns the same paragraph but truncated (ex. George Washington was). I don’t think this has anything to do with the max_length parameter, since the input was much shorter than the max_length. Do you have any idea why this is happening?

edit: I was able to solve this issue. Thanks for all your help anyways!