How to join separate strings to translate them together for better speed?

guzikk · October 7, 2023, 2:56pm

Hi,
I have a bunch of separate strings to translate between languages using MarianMTModel, the Helsinki-NLP/opus-mt-pl-en to be exact. These strings are separate and don’t have to be full sentences, one can be a word, another can be a phrase and another can be several sentences.

If I treat each separate element as a batch, and I have possibly hundreds of them, the translation process is very long but very accurate and overall good. Unfortunately too long. Therefore I try to join as many elements as I can (so that the cumulative length is lower than the model’s max_length). I decided to join them with such symbol <T> because later I need to split them because I need to target each element separately. So these 2:

text one
text two

becomes: "text one <T> text two" which then goes into the model.

Well, this doesn’t work particularly well for the shorter elements, some translations are terrible. I’m aware that the <T> symbol isn’t in the dictionary, but I tried adding new elements to the dictionary but that requires rerunning the model’s training which I cannot do right.

Has anyone had such a problem or has any ideas how to solve it? Maybe I’ve been doing something wrong with adding special symbols.
Thanks

Topic		Replies	Views
Boosting the speed of a translation model Helsinki-NLP/opus-mt-en-ar 🤗Transformers	0	734	October 2, 2023
MarianMTModel stops translating on encountering "-" character Models	0	136	October 17, 2023
Speed up translation model Beginners	0	344	September 28, 2023
Looking for translation mechanism (es-en,en-es) 🤗Transformers	1	534	August 10, 2020
MarianMT training produce "▁" in results 🤗Transformers	1	325	February 21, 2022

How to join separate strings to translate them together for better speed?

Related topics