T5 for conditional generation: getting started

  1. You can choose whatever format that works well for you, only thing to note is your dataset or collatorshould return input_ids, attention_mask and labels.

  2. To add new tokens

tokenizer.add_tokens(list_of_new_tokens)

# resize the embeddings
 model.resize_token_embeddings(len(tokenizer))
  1. Using task prefix is optional.
  2. No, you won’t need to register the task, the original T5 repo requires that but it’s not required here.

You might find these two notebooks useful

  1. Fine-tune T5 for Classification and Multiple Choice
  2. Fine-tune T5 for Summarization
  3. Train T5 on TPU

Note: These notebooks manually add the eos token (</s>), but it’s not with the current version, the tokenizer will handle that.

Here’s a great thread on tips and tricks for T5 fine-tuning

1 Like