How to properly add news tokens to tokenizer vocab?

Hi, all

I am training the t5 model for the seq2deq model.
I want to add new tokens to my tokenizer as my text dataset is kind of specific.
But somehow, after doing this, my model’s performance worsened.
And I don’t know why.

Here is my code

model_name = "google/t5-v1_1-small"
tokenizer = AutoTokenizer.from_pretrained(model_name, legacy=False)

new_tokenizer = AutoTokenizer.from_pretrained("...")
new_vocab = new_tokenizer.vocab
original_vocab = tokenizer.vocab

new_tokens = []
for key in new_vocab.keys():
    if key not in original_vocab:

num_added_toks = tokenizer.add_tokens(new_tokens)
print("We have added", num_added_toks, "tokens")

model_name = "google/t5-v1_1-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

I have added 13802 tokens to the t5 tokenizer. Is it too much?
I would really appreciate some helps