The translator model produces a translation of only part of the text

Translator model stops in the middle of the text

I’m trying to train a translator model, I take a ready-made model (it’s weak) and train it on other datasets that I collect into one object using interleave_datasets()

The translation quality has improved, but a worse problem has appeared: the model stops translating the text completely, it translates exactly 65 inputs, so it usually breaks off the translation right in the middle of a word

I tried transferring via pipeline() and manually giving tokens:

inputs = tokenizer(text, return_tensors='pt')
translated_inputs = model.generate(**inputs)      # !!! Here it always gives only 65 inputs
translation = tokenizer.decode(translated_inputs[0], skip_special_tokens=True)

I took the “Helsinki-NLP/opus” models and further trained them on those datasets that I could find for each required language pair, after interleave_datasets() I get a dataset of ~1-2,000,000 rows

Here are my training parameters (I tried to focus on quality and not speed):

args = Seq2SeqTrainingArguments(
     path_dir,
     evaluation_strategy="steps",
     eval_steps=3000,
     save_strategy="epoch",
     learning_rate=2e-5,
     per_device_train_batch_size=32,
     per_device_eval_batch_size=64,
     per_gpu_train_batch_size=64,
     per_gpu_eval_batch_size=64,
     weight_decay=0.01,
     save_total_limit=4,
     num_train_epochs=10,
     predict_with_generate=True,
     push_to_hub=False,
     warmup_steps=30000,
     adafactor = True
)

Try increasing the “max_length” and “max_new_tokens” parameters of the model generation config. You can give these values as kwargs in the pipeline:

text_generator = pipeline("text-generation", model = model, tokenizer = tokenizer)
text_generator(prompt, **{"max_new_tokens":50, "max_length":150})

Reference : Generation

1 Like

Sorry for answering after so long

Unfortunately this didn’t help. By changing max_length and max_new_tokens I can only reduce the result of the transfer, but even if I transfer 500 or 1000 there, it still stops in the same place

However, when I set max_length=10, a warning appeared in the console, which said that my input was > 10 and it was worth increasing max_length. But before that, it did not produce such a warning, although it also cuts off the text, and if I set the max_length too small