Reused tokenizer returns unk

Hello
I’m training a tokenizer from an old one (Bert based)
the new tokenizer returns [UNK] for words already exist in vocabulary and run correctly with the old tokenizer

from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")

tokens = old_tokenizer.tokenize('مع')
tokens

returns [‘مع’]
while with the new tokenizer

new_tokenizer= old_tokenizer.train_new_from_iterator(training_corpus, 10)
tokens = new_tokenizer.tokenize('مع')
tokens

returns [‘[UNK]’]

Can any one help me please !

1 Like

I am also facing similar issue. Were you able to sort it our ?