I have a bunch of separate strings to translate between languages using MarianMTModel, the
Helsinki-NLP/opus-mt-pl-en to be exact. These strings are separate and don’t have to be full sentences, one can be a word, another can be a phrase and another can be several sentences.
If I treat each separate element as a batch, and I have possibly hundreds of them, the translation process is very long but very accurate and overall good. Unfortunately too long. Therefore I try to join as many elements as I can (so that the cumulative length is lower than the model’s
max_length). I decided to join them with such symbol
<T> because later I need to split them because I need to target each element separately. So these 2:
- text one
- text two
"text one <T> text two" which then goes into the model.
Well, this doesn’t work particularly well for the shorter elements, some translations are terrible. I’m aware that the
<T> symbol isn’t in the dictionary, but I tried adding new elements to the dictionary but that requires rerunning the model’s training which I cannot do right.
Has anyone had such a problem or has any ideas how to solve it? Maybe I’ve been doing something wrong with adding special symbols.