Adding tokens, but tokenizer doesn't use them

Hi, Im trying to add tokens to a pretrained tokenizer.
First I initialized the tokenizer:

tokenizer = AutoTokenizer.from_pretrained(“mistralai/Mistral-7B-Instruct-v0.2”)

Next I created a training iterator and trained a new tokenizer:

training_corpus = get_training_corpus()
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus,vocab_size=10000)#,10000)

Taking the diff:

tokens_to_add = list(set(new_tokenizer.vocab.keys())- set(tokenizer.vocab.keys()))
output = tokenizer.add_tokens(tokens_to_add)

tokenizer is updated, I can see its now in the correct size (i.e original size=32000+ added tokens), I can also see the new token under added_tokens_decoder and added_tokens_encoder. Everything seems great.

But, when Im trying to tokenize my input data:

tokenizer.tokenize(x, return_tensors=“pt”)

The tokenizer just doesnt use the new tokens.
Any idea what Im doing wrong?!

Thanks!

1 Like

Just experienceed the same issue with llama2-7b, did you find a solution to that?

In my case adding normalized=False fixed the issue I had on my single example.