Translator model stops in the middle of the text
I’m trying to train a translator model, I take a ready-made model (it’s weak) and train it on other datasets that I collect into one object using interleave_datasets()
The translation quality has improved, but a worse problem has appeared: the model stops translating the text completely, it translates exactly 65 inputs, so it usually breaks off the translation right in the middle of a word
I tried transferring via pipeline() and manually giving tokens:
inputs = tokenizer(text, return_tensors='pt')
translated_inputs = model.generate(**inputs) # !!! Here it always gives only 65 inputs
translation = tokenizer.decode(translated_inputs[0], skip_special_tokens=True)
I took the “Helsinki-NLP/opus” models and further trained them on those datasets that I could find for each required language pair, after interleave_datasets() I get a dataset of ~1-2,000,000 rows
Here are my training parameters (I tried to focus on quality and not speed):
args = Seq2SeqTrainingArguments(
path_dir,
evaluation_strategy="steps",
eval_steps=3000,
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
per_gpu_train_batch_size=64,
per_gpu_eval_batch_size=64,
weight_decay=0.01,
save_total_limit=4,
num_train_epochs=10,
predict_with_generate=True,
push_to_hub=False,
warmup_steps=30000,
adafactor = True
)