Hello
I’m training a tokenizer from an old one (Bert based)
the new tokenizer returns [UNK] for words already exist in vocabulary and run correctly with the old tokenizer
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
tokens = old_tokenizer.tokenize('مع')
tokens
returns [‘مع’]
while with the new tokenizer
new_tokenizer= old_tokenizer.train_new_from_iterator(training_corpus, 10)
tokens = new_tokenizer.tokenize('مع')
tokens
returns [‘[UNK]’]
Can any one help me please !