Unknown character although it is present in the vocabulary list

Hello

I am using the m2m100 model but I noticed that some characters in the vocabulary list (data_dict128k.txt) are marked as unknown by the model when translating, is there any way to make it add these vocabularies?

eg. this char ▬

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

while True:
  text = input("Entrez un texte: ")
  tokenizer.src_lang = "en"
  encoded_hi = tokenizer(text, return_tensors="pt")
  
  generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
  print("\n\n"+tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])
▬▬Hello▬ world▬

return

<unk> <unk> Monde <unk>

How can I make it add this character to the translation, moreover I would like the model to be able to use emojis, so I would like to know how I can train the model globally without going to each prefix of a language? (There are more than 9000 of them).

Thank you for your future response.

1 Like

Yes you can augment the vocabulary with tokens that are not in the original M2M100Tokenizer before training, to avoid the <unk>. See m2m-100-finetune/tok.py at main · TartuNLP/m2m-100-finetune · GitHub for how to add new tokens to the tokenizer.