I use the huggingface Trainer class to finetune a mT5 model. Since my data (AMR graphs) contains mayn tokens like :ARG0
, :op2
or :name
, which are generally split into wordpieces I added those 219 tokens to the tokenizer with
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
toks = [ ... my ist of tokens ...]
tokenizer.add_tokens(toks, special_tokens=False)
model.resize_token_embeddings(len(tokenizer))
and than starting the training.
Once the model is finetuned I run my test, and once again I add the same tokens in the same order to the tokenizer. But the result is catastrophic, it drops from 80% F1 to 50%, so evidently something is going wrong. I compared the tokenisation with and without added tokens and this looks OK. But I have not the slightest idea where to check. Can you give me a hint about the error I’m committing ?
I thought (possibly erroneously) that the added tokens will get random vectors which will be updated during the finetuning. If this is not the case, is there a way to do this? If not what is the point of adding new tokens?
Could anybody elaborate on this ?
Environment info
-
transformers
version: 4.11.3 - Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- PyTorch version (GPU): 1.9.1
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no