Hi, Im trying to add tokens to a pretrained tokenizer.
First I initialized the tokenizer:
tokenizer = AutoTokenizer.from_pretrained(“mistralai/Mistral-7B-Instruct-v0.2”)
Next I created a training iterator and trained a new tokenizer:
training_corpus = get_training_corpus()
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus,vocab_size=10000)#,10000)
Taking the diff:
tokens_to_add = list(set(new_tokenizer.vocab.keys())- set(tokenizer.vocab.keys()))
output = tokenizer.add_tokens(tokens_to_add)
tokenizer is updated, I can see its now in the correct size (i.e original size=32000+ added tokens), I can also see the new token under added_tokens_decoder
and added_tokens_encoder
. Everything seems great.
But, when Im trying to tokenize my input data:
tokenizer.tokenize(x, return_tensors=“pt”)
The tokenizer just doesnt use the new tokens.
Any idea what Im doing wrong?!
Thanks!