Hi, all
I am training the t5 model for the seq2deq model.
I want to add new tokens to my tokenizer as my text dataset is kind of specific.
But somehow, after doing this, my model’s performance worsened.
And I don’t know why.
Here is my code
model_name = "google/t5-v1_1-small"
tokenizer = AutoTokenizer.from_pretrained(model_name, legacy=False)
new_tokenizer = AutoTokenizer.from_pretrained("...")
new_vocab = new_tokenizer.vocab
original_vocab = tokenizer.vocab
new_tokens = []
for key in new_vocab.keys():
if key not in original_vocab:
new_tokens.append(key)
num_added_toks = tokenizer.add_tokens(new_tokens)
print("We have added", num_added_toks, "tokens")
model_name = "google/t5-v1_1-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))
I have added 13802 tokens to the t5 tokenizer. Is it too much?
I would really appreciate some helps
THANKS