(x-posting with StackOverflow)
I’m having relatively good results with HelsinkiNlp models for translation, except for one thing: some special characters are omitted from the translation. If I decode without skipping the special tokens, I get the following:
<pad> <unk> a fait mal !</s>
<unk>
is right where the translation should include a French Ç (expected result “Ça fait mal” from source “That hurts!”). Note:
- lower case ç works just fine.
- Exact same issue with È:
<pad> APR<unk> S VOUS !</s>
(should be “APRÈS VOUS !”)
It’s definitely not a model issue, but a me issue, if I try on OpusTranslate Space (OPUS Translate - a Hugging Face Space by Helsinki-NLP), it works just fine.
I tried using the code verbatim from the model page, to no avail (Helsinki-NLP/opus-mt-tc-big-en-fr · Hugging Face)
My current code is not far from it, and produces exactly the result I posted above:
def __init__(self, model_path_or_name: str, source_language:str, target_langueg:str):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.tokenizer = MarianTokenizer.from_pretrained(model_path_or_name)
self.model = MarianMTModel.from_pretrained(model_path_or_name).to(self.device)
def single_translate(self, text: str) -> str:
"""
Translate a single sentence and return the translated string only.
"""
inputs = self.tokenizer([text], return_tensors="pt", padding=True, truncation=True)
input_ids = inputs.input_ids.to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(input_ids=input_ids)
decoded = self.tokenizer.batch_decode(outputs, skip_special_tokens=False)
return decoded[0]
Any advice would be greatly appreciated!