-
You can choose whatever format that works well for you, only thing to note is your
dataset
orcollator
should returninput_ids
,attention_mask
andlabels
. -
To add new tokens
tokenizer.add_tokens(list_of_new_tokens)
# resize the embeddings
model.resize_token_embeddings(len(tokenizer))
- Using task prefix is optional.
- No, you won’t need to register the task, the original
T5
repo requires that but it’s not required here.
You might find these two notebooks useful
Note: These notebooks manually add the
eos
token (</s>
), but it’s not with the current version, the tokenizer will handle that.
Here’s a great thread on tips and tricks for T5 fine-tuning