MarianMt translation issue

dunfash · December 31, 2020, 10:13pm

I’m trying to work on a translation yo-en using MarianMt since I found a pretrained bilingual for my need however I checked the link here, but the source txt language was encoded differently, a lot of the characters changed. I need help in proceeding, I think it would affect performance. Thanks🤍

https://object.pouta.csc.fi/OPUS-MT-models/yo-en/opus-2020-01-16.test.txt

BramVanroy · January 2, 2021, 9:20am

You are probably on Windows, right? That text file contains UTF-8 characters, but windows (still) defaults to cp1252 or something like that. That means it does not correctly display those characters by default. That does not mean that the text is incorrect: in byte format, it is correct but your computer is just showing it incorrectly. You can check this by downloading the file and opening it in your favourite editor with an UTF-8 encoding. So if you open this file in Python, for instance, then you have to use something like

with open(yourfile, encoding="utf-8") as fh:
   ...

That should help you. However, this is a very general issue and has nothing at all to do with transformers or any other HF libraries. So please use some other forums for general questions like this, like Stack Overflow.

Topic		Replies	Views
MarianMT training produce "▁" in results 🤗Transformers	1	325	February 21, 2022
MarianMTModel stops translating on encountering "-" character Models	0	136	October 17, 2023
Looking for translation mechanism (es-en,en-es) 🤗Transformers	1	534	August 10, 2020
Opus-MT: Translation returns <unk> token Models	3	13	June 6, 2025
Issue with using a save_pretrained model (MarianMT) 🤗Transformers	1	447	April 5, 2023

MarianMt translation issue

Related topics