Opus-MT: Translation returns <unk> token

mathdons · June 5, 2025, 12:50pm

(x-posting with StackOverflow)

I’m having relatively good results with HelsinkiNlp models for translation, except for one thing: some special characters are omitted from the translation. If I decode without skipping the special tokens, I get the following:

<pad> <unk> a fait mal !</s>

<unk> is right where the translation should include a French Ç (expected result “Ça fait mal” from source “That hurts!”). Note:

lower case ç works just fine.
Exact same issue with È: <pad> APR<unk> S VOUS !</s> (should be “APRÈS VOUS !”)

It’s definitely not a model issue, but a me issue, if I try on OpusTranslate Space (OPUS Translate - a Hugging Face Space by Helsinki-NLP), it works just fine.

I tried using the code verbatim from the model page, to no avail (Helsinki-NLP/opus-mt-tc-big-en-fr · Hugging Face)

My current code is not far from it, and produces exactly the result I posted above:

def __init__(self, model_path_or_name: str, source_language:str, target_langueg:str):
    self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    self.tokenizer = MarianTokenizer.from_pretrained(model_path_or_name)
    self.model = MarianMTModel.from_pretrained(model_path_or_name).to(self.device)

def single_translate(self, text: str) -> str:
    """
    Translate a single sentence and return the translated string only.
    """
    inputs = self.tokenizer([text], return_tensors="pt", padding=True, truncation=True)
    input_ids = inputs.input_ids.to(self.model.device)
    with torch.no_grad():
        outputs = self.model.generate(input_ids=input_ids)
    decoded = self.tokenizer.batch_decode(outputs, skip_special_tokens=False)
    return decoded[0]

Any advice would be greatly appreciated!

John6666 · June 6, 2025, 12:58pm

It seems model issue…

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
print(pipe("That hurts!")) # [{'translation_text': 'Ça fait mal !'}]
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-fr")
print(pipe("That hurts!")) # [{'translation_text': 'a fait mal !'}]

mathdons · June 6, 2025, 1:37pm

Damn, it never occurred to me that the space could be using a different model in the same family/language. Thanks a lot, you’ve saved me a lot of headaches trying to find what was going wrong. Going to add a comment on the model / community page.

system · June 7, 2025, 1:38am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Helsinki-NLP/opus-mt-it-en missing Models	2	450	August 25, 2020
Rare buggy translations when using Helsinki-NLP models Models	0	575	April 19, 2022
Fast tokenizer for marianMTModel 🤗Tokenizers	1	516	September 26, 2022
Boosting the speed of a translation model Helsinki-NLP/opus-mt-en-ar 🤗Transformers	0	739	October 2, 2023
Looking for translation mechanism (es-en,en-es) 🤗Transformers	1	535	August 10, 2020

Opus-MT: Translation returns <unk> token

Related topics