Hi,
I have a bunch of separate strings to translate between languages using MarianMTModel, the Helsinki-NLP/opus-mt-pl-en
to be exact. These strings are separate and don’t have to be full sentences, one can be a word, another can be a phrase and another can be several sentences.
If I treat each separate element as a batch, and I have possibly hundreds of them, the translation process is very long but very accurate and overall good. Unfortunately too long. Therefore I try to join as many elements as I can (so that the cumulative length is lower than the model’s max_length
). I decided to join them with such symbol <T>
because later I need to split them because I need to target each element separately. So these 2:
- text one
- text two
becomes: "text one <T> text two"
which then goes into the model.
Well, this doesn’t work particularly well for the shorter elements, some translations are terrible. I’m aware that the <T>
symbol isn’t in the dictionary, but I tried adding new elements to the dictionary but that requires rerunning the model’s training which I cannot do right.
Has anyone had such a problem or has any ideas how to solve it? Maybe I’ve been doing something wrong with adding special symbols.
Thanks