Use t5-small ft a English to German Bidirectional translation model

I use t5-small to train a model,want to finish the English to German Bidirectional translation.,but i have met some problems.
I use the wmt14 dataset to train this model, but itā€™s training loss decline too low, i donā€™t konw why.

*import random
def preprocess_data(examples):

inputs = []
targets = []


for translation in examples['translation']:

    en_text = translation['en']
    de_text = translation['de']

    if random.random() < 0.5:
        inputs.append(f"translate English to German: {en_text}")
        targets.append(de_text)
    else:

        inputs.append(f"translate German to English: {de_text}")
        targets.append(en_text)


return {'input_text': inputs, 'target_text': targets}*

*def tokenize_data(examples):

model_inputs = tokenizer(examples['input_text'], max_length=128, truncation=True, padding='max_length',return_tensors='pt')

with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples['target_text'], max_length=128, truncation=True, padding='max_length',return_tensors='pt')

model_inputs['labels'] = labels['input_ids']
return model_inputs*

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
device = torch.device(ā€œcudaā€ if torch.cuda.is_available() else ā€œcpuā€)
print(f"Using device: {device}")
model.to(device)
training_args = Seq2SeqTrainingArguments(
output_dir=ā€˜./results2ā€™,
num_train_epochs=20,
per_device_train_batch_size=128,
per_device_eval_batch_size=128,
save_steps=2000,
save_total_limit=10,
eval_strategy=ā€˜stepsā€™,
eval_steps=200,
warmup_steps=200,
weight_decay=0.01,
logging_dir=ā€˜./logsā€™,
logging_steps=10,
predict_with_generate=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=test_dataset
)

1 Like

According to Hugging Chat,


When dealing with a low training loss in a sequence-to-sequence model, especially for a bidirectional translation task like English to German using the T5-small model and the WMT14 dataset, there are several potential areas to investigate and adjust. Here are some suggestions based on the information provided and the sources referenced:

  1. Batch Size and Gradient Accumulation:
  • Your current batch size is set to 128 per device, which is quite large. This might be causing issues if your GPU memory is not sufficient, leading to potential issues with training. Consider reducing the batch size and increasing the gradient accumulation steps to compensate 1.
  • For example, you could try setting per_device_train_batch_size=16 and gradient_accumulation_steps=8.
  1. Optimizer and Learning Rate:
  • The default learning rate might not be optimal for your specific task. You can experiment with different learning rates to see if it affects the training loss 2.
  • Additionally, consider using an optimizer like 8-bit Adam (from the bitsandbytes library) which can help with memory efficiency and potentially lead to better convergence 2.
  1. Data Preprocessing:
  • Ensure that your data preprocessing steps are correct and that the data is being fed into the model in the expected format. The random sampling in your preprocess_data function seems fine, but double-check that the tokenizer is correctly handling both English and German texts.
  1. Model Configuration:
  • Since you are using a smaller model like T5-small, it might not have enough capacity to handle bidirectional translation effectively. Consider using a larger model if possible.
  • Ensure that the tokenizer is set up for both languages. You might need to use a multilingual tokenizer or handle the tokenization of both languages separately.
  1. Training Strategy:
  • Make sure that your evaluation strategy is correctly set up. You are using eval_strategy=ā€˜stepsā€™ which is fine, but ensure that the eval_dataset is properly defined and representative of the task.
  • Monitor the evaluation metrics closely to see if the model is improving on the validation set, even if the training loss is low.
  1. Debugging:
  • Print out a few samples from the dataset and the tokenized inputs to ensure that the data is being processed correctly.
  • Check for any warnings or errors in the training logs that might indicate issues with the data or model configuration.

By adjusting these parameters and strategies, you should be able to diagnose and potentially resolve the issue with the low training loss. Remember that training sequence-to-sequence models can be complex and may require some experimentation to find the right configuration 4.

I use the h100 80gb to train this model, set the batchsize 128 can short the training time, otherwise i must spend over 12h to train 5 epoch. I donā€™t know how to use muti-gpu to train, i always use only one gpu.After training 10 epoch, the test bleu is 0.29, I want to make it higher , but more epoch donā€™t work ,so I want to know what else tips can improve the bleu.

1 Like

I wonder if there are any problems specific to T5ā€¦?:thinking:
Or maybe the learning rate is too low.
Or maybe something is wrong and the model is not actually learning anything. (I saw a case like that a few days ago, but I canā€™t find itā€¦)
There was also a post that mentioned bleu.