Unknown character although it is present in the vocabulary list

Jour · March 17, 2022, 7:20am

Hello

I am using the m2m100 model but I noticed that some characters in the vocabulary list (data_dict128k.txt) are marked as unknown by the model when translating, is there any way to make it add these vocabularies?

eg. this char ▬

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

while True:
  text = input("Entrez un texte: ")
  tokenizer.src_lang = "en"
  encoded_hi = tokenizer(text, return_tensors="pt")
  
  generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
  print("\n\n"+tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])

▬▬Hello▬ world▬

return

<unk> <unk> Monde <unk>

How can I make it add this character to the translation, moreover I would like the model to be able to use emojis, so I would like to know how I can train the model globally without going to each prefix of a language? (There are more than 9000 of them).

Thank you for your future response.

raphaelmerx · December 13, 2022, 6:24am

Yes you can augment the vocabulary with tokens that are not in the original M2M100Tokenizer before training, to avoid the <unk>. See m2m-100-finetune/tok.py at main · TartuNLP/m2m-100-finetune · GitHub for how to add new tokens to the tokenizer.

Topic		Replies	Views
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	422	June 29, 2021
AutoModelForSeq2SeqLM.from_pretrained('facebook/nllb-200-1.3B') loads M2M100 Models	0	32	September 16, 2024
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
How to Quantization the m2m-100 418M model？？ 🤗Transformers	2	19	February 22, 2025
M2m-100 finetuning Models	4	3215	November 23, 2022

Unknown character although it is present in the vocabulary list

Related topics