Opus-MT: Translation returns <unk> token

(x-posting with StackOverflow)

I’m having relatively good results with HelsinkiNlp models for translation, except for one thing: some special characters are omitted from the translation. If I decode without skipping the special tokens, I get the following:

<pad> <unk> a fait mal !</s>

<unk> is right where the translation should include a French Ç (expected result “Ça fait mal” from source “That hurts!”). Note:

  • lower case ç works just fine.
  • Exact same issue with È: <pad> APR<unk> S VOUS !</s> (should be “APRÈS VOUS !”)

It’s definitely not a model issue, but a me issue, if I try on OpusTranslate Space (OPUS Translate - a Hugging Face Space by Helsinki-NLP), it works just fine.

I tried using the code verbatim from the model page, to no avail (Helsinki-NLP/opus-mt-tc-big-en-fr · Hugging Face)

My current code is not far from it, and produces exactly the result I posted above:

def __init__(self, model_path_or_name: str, source_language:str, target_langueg:str):
    self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    self.tokenizer = MarianTokenizer.from_pretrained(model_path_or_name)
    self.model = MarianMTModel.from_pretrained(model_path_or_name).to(self.device)

def single_translate(self, text: str) -> str:
    """
    Translate a single sentence and return the translated string only.
    """
    inputs = self.tokenizer([text], return_tensors="pt", padding=True, truncation=True)
    input_ids = inputs.input_ids.to(self.model.device)
    with torch.no_grad():
        outputs = self.model.generate(input_ids=input_ids)
    decoded = self.tokenizer.batch_decode(outputs, skip_special_tokens=False)
    return decoded[0]

Any advice would be greatly appreciated!

1 Like

It seems model issue…

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
print(pipe("That hurts!")) # [{'translation_text': 'Ça fait mal !'}]
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-fr")
print(pipe("That hurts!")) # [{'translation_text': 'a fait mal !'}]
1 Like

Damn, it never occurred to me that the space could be using a different model in the same family/language. Thanks a lot, you’ve saved me a lot of headaches trying to find what was going wrong. Going to add a comment on the model / community page.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.