Add_tokens + finetune

jheinecke · February 25, 2022, 12:05pm

I use the huggingface Trainer class to finetune a mT5 model. Since my data (AMR graphs) contains mayn tokens like :ARG0, :op2 or :name, which are generally split into wordpieces I added those 219 tokens to the tokenizer with

from  transformers import MT5ForConditionalGeneration, MT5Tokenizer
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")
model     = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
toks = [ ... my ist of tokens ...]
tokenizer.add_tokens(toks, special_tokens=False)
model.resize_token_embeddings(len(tokenizer))

and than starting the training.

Once the model is finetuned I run my test, and once again I add the same tokens in the same order to the tokenizer. But the result is catastrophic, it drops from 80% F1 to 50%, so evidently something is going wrong. I compared the tokenisation with and without added tokens and this looks OK. But I have not the slightest idea where to check. Can you give me a hint about the error I’m committing ?
I thought (possibly erroneously) that the added tokens will get random vectors which will be updated during the finetuning. If this is not the case, is there a way to do this? If not what is the point of adding new tokens?
Could anybody elaborate on this ?

Environment info

transformers version: 4.11.3
Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.17
Python version: 3.8.12
PyTorch version (GPU): 1.9.1
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Topic		Replies	Views
T5 Seq2Seq custom fine-tuning Models	7	3715	November 30, 2020
Trainer.train() seems to finish almost instantly 🤗Transformers	0	520	September 29, 2023
Finetuning mT5 for specific language pair Models	0	144	October 17, 2024
Adding tokens to mT5, tensorflow get ValueError 🤗Transformers	0	469	August 9, 2021
Huggingface t5 models seem to not download a tokenizer file 🤗Tokenizers	0	636	December 16, 2022

Add_tokens + finetune

Environment info

Related topics