Hello
I am using the m2m100 model but I noticed that some characters in the vocabulary list (data_dict128k.txt) are marked as unknown by the model when translating, is there any way to make it add these vocabularies?
eg. this char ▬
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
while True:
text = input("Entrez un texte: ")
tokenizer.src_lang = "en"
encoded_hi = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
print("\n\n"+tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])
▬▬Hello▬ world▬
return
<unk> <unk> Monde <unk>
How can I make it add this character to the translation, moreover I would like the model to be able to use emojis, so I would like to know how I can train the model globally without going to each prefix of a language? (There are more than 9000 of them).
Thank you for your future response.