Finetuning T5 for a task

NR1 · August 29, 2021, 1:58am

In the paper for T5, I noticed that the inputs to the model always a prefix (ex. “summarize: …” or “translate English to German: …”. When I finetune a T5 model, can I use any phrase/word that I want as a prefix, or can T5 only understand a specific predefined list of prefixes?

hugomontenegro · August 29, 2021, 8:19am

T5 only has been trained on a specific set of prefixes. You can find a list here:
https://arxiv.org/pdf/1910.10683.pdf (starting at page 47)

That said, you can just finetune without a prefix (or with a custom prefix) and it should still work out.

NR1 · August 29, 2021, 3:02pm

Thank you so much for your reply. If I wasn’t using a prefix, and I want to pass two sentences as input to the model during training, how would I format the input string?

NR1 · August 29, 2021, 4:40pm

@hugomontenegro Also, is there a maximum input length for a T5 model?

hugomontenegro · August 29, 2021, 4:53pm

What exactly is your usecase? What’s the desired output of the two sentences? Perhaps just concatenating them with a separator might be sufficient.

hugomontenegro · August 29, 2021, 4:54pm

As for input length, it’s unconstrained. T5 can take in an arbitrary sequence length, however, memory requirements still apply. Memory consumption scales quadratically with input sentence length, so you’ll quickly run out of it.

Cheers

NR1 · August 29, 2021, 5:49pm

@hugomontenegro For example, if I am trying to predict a paraphrase given a context paragraph and a sentence to be paraphrased, how would I accomplish this (I am trying to input a “context” and a “sentence” and output a “paraphrase”)? Does spacing between the two parts of the input matter?

hugomontenegro · August 29, 2021, 6:16pm

The whole point of the T5 paper was showing that purely by prepending a prefix multiple distinct tasks could be done, using the same model architecture, to close to SOTA levels.

That leads us to your question: can your problem be done with T5? The answer is yeah, probably.

As to how to format the input for this task I’d probably try the following:

If we have the following input:
Input: {‘context’: ‘food topics’, ‘sentence’:‘sushi is a great dessert’}

Then I’d convert it into the following:
Processed Input: f"summarize: context: {context}; sentence: {sentence}"
(So: f"summarize: context: food topics; sentence: sushi is a great dessert")

The target is of course your paraphrase.

This way you separate context and sentence for the model, a separation which it should eventually learn with enough training examples. Also, I’ve reused the “summarize” keyword from T5, since it is vaguely similar to this task and might help a bit (especially initially).

Anyways this should work given enough training examples. Good Luck.

NR1 · August 29, 2021, 7:02pm

Thank you!

NR1 · August 29, 2021, 9:00pm

@hugomontenegro
I currently preprocessed my dataset to the form {'input_ids': **tokenized ids of input**, 'attention_mask': **attention mask of input** , 'decoder_input_ids': **tokenized ids of output**, 'decoder_attention_mask': **attention mask of input**, 'labels': **tokenized ids of output**}, and ended up with a list of dictionaries in the above format.

However, when I pass this list of dictionaries to the Trainer class as the train_dataset, and call trainer.train(), I get the following error:

ValueError: too many values to unpack (expected 2)

Can you please give me advice on how to fix this? (Sorry for bombarding you with so many questions)

hugomontenegro · August 29, 2021, 9:08pm

Sorry, I don’t have the time to help with debugging, and you’re better served anyways by going through the huggingface docs and adapting/understanding the code from a few examples.

In particular, these two links should be helpful:

github.com

huggingface/notebooks/blob/master/examples/summarization.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "X4cRE8IbIrIV"
   },
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "MOsHUjgdIrIW",

This file has been truncated. show original

and also this:

github.com

huggingface/transformers/blob/master/examples/pytorch/summarization/run_summarization.py

#!/usr/bin/env python
# coding=utf-8
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for sequence to sequence.
"""
# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.

This file has been truncated. show original

Take a look at those and adapt the code to your needs (especially preprocessing part).

Cheers

NR1 · August 30, 2021, 2:08am

@hugomontenegro Thanks so much for the links. I was able to get it working!

hugomontenegro · August 30, 2021, 5:52am

Awesome!

Are the results after training any good? Interesting use case frankly. I’ve never seen anyone use NLP to paraphrase!

NR1 · August 30, 2021, 12:58pm

Overall, the model performs relatively well. I am still trying to find other paraphrasing datasets, to make my model more robust against edge cases.

Rbaten · September 2, 2021, 4:29pm

If anyone is curious, it is possible to invent/add a new prefix yourself for new tasks. I’ve done so in cases where I had a lot of data so I’m not sure how well it will work with smaller datasets. It’s unclear how well it transfers knowledge from the other tasks when you do this but my guess is it’s a lot better than starting from scratch. Parsing and creating basic representations of the input text is still helpful to achieving your task. Interestingly, it was still able to use the original prefixes and do translation etc. fairly well after training was completed on my large dataset containing only the new prefix.

The recommendation to reuse the summarization prefix is probably a good thing to try, it would be interesting to see results of reusing it vs not reusing it and adding a new prefix instead.

NR1 · September 3, 2021, 4:27pm

@Rbaten How many samples would you estimate that the dataset would need to be able to learn a new prefix?

Rbaten · September 3, 2021, 11:00pm

My data had size 100k examples if I remember correctly. You can most likely get away with a lot less without a big trade-off on performance if you tune correctly but more data is almost always better. Maybe on the order of a couple hundred or thousand depending on the complexity of the task?

Note that details related to training and/or how you structure your inputs to be similar to what is seen during pretraining start to have larger impact when you go down to small dataset sizes. If you have a really small dataset and your task is similar enough to summarization, that’s when you may see some lift by trying to use the existing prompt. There was a paper by huggingface on prompts and data efficiency during fine tuning a while back. IMO, try both ways and see what works best, I’d be interested in hearing any results you come up with.

NR1 · September 4, 2021, 4:50am

@Rbaten My dataset has 80K samples, but there is one part for the input and one for the output (there is a paragraph passed as input, and a paragraph received as output). For this scenario, do I even need to use a prompt/prefix with T5?

Rbaten · September 4, 2021, 7:09pm

I tried to train without a prefix at all at first and t5 didn’t seem to handle that too well. Not that it didn’t work at all, it just didn’t work nearly as well for me than when I added the prefix. It seems to expect to parse out a prefix and base the rest of what it does fairly heavily on that.

Would suggest doing:

input ids = (Your custom prefix here): (input)
labels = (output)

You can play with different prefixes. If it describes the task (paraphrase), you may get better results earlier in training but your data is large enough that I think you can get away with using anything and as long as it’s consistent within the training run and you train long enough, it should work and produce similar results.

NR1 · September 4, 2021, 7:22pm

Thanks @Rbaten for all your help! I also trained a model for key-phrase extraction by passing it an input paragraph and training it to output the same paragraph, but with the key-phrases surrounded in ‘|||’ (ex. |||George Washington||| was a president).The model appeared to actually learn (the training and validation loss went down), but when I try to make a prediction with the model, it just returns the same paragraph but truncated (ex. George Washington was). I don’t think this has anything to do with the max_length parameter, since the input was much shorter than the max_length. Do you have any idea why this is happening?

edit: I was able to solve this issue. Thanks for all your help anyways!

Topic		Replies	Views
Does task specific prefix matters for T5 fine-tuning? Beginners	9	7295	June 28, 2021
T5 Finetuning Tips Models	48	56625	November 3, 2024
About Transformer task prefix Beginners	0	833	May 4, 2021
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	629	October 14, 2020
Finetune T5 with T5ForConditionalGeneration to multitask for Q&A and Summarization 🤗Transformers	0	636	November 28, 2023

Finetuning T5 for a task

Related topics