MarianMt translation issue

I’m trying to work on a translation yo-en using MarianMt since I found a pretrained bilingual for my need however I checked the link here, but the source txt language was encoded differently, a lot of the characters changed. I need help in proceeding, I think it would affect performance. Thanks🤍

https://object.pouta.csc.fi/OPUS-MT-models/yo-en/opus-2020-01-16.test.txt

You are probably on Windows, right? That text file contains UTF-8 characters, but windows (still) defaults to cp1252 or something like that. That means it does not correctly display those characters by default. That does not mean that the text is incorrect: in byte format, it is correct but your computer is just showing it incorrectly. You can check this by downloading the file and opening it in your favourite editor with an UTF-8 encoding. So if you open this file in Python, for instance, then you have to use something like

with open(yourfile, encoding="utf-8") as fh:
   ...

That should help you. However, this is a very general issue and has nothing at all to do with transformers or any other HF libraries. So please use some other forums for general questions like this, like Stack Overflow.

2 Likes