T5-small performance degradation with larger dataset: seeking advice

Hello everyone,

I’m new to machine learning and this is my first post here. I’m working on a small project using google/t5-small for address correction (separating irrelevant information from the address). I found a good configuration for training with my initial dataset of about 1600 records. Over time, the dataset has grown to about 2400 records. However, keeping the same configuration, the model’s performance in address correction has worsened after training on this larger dataset.

Here’s the current training configuration for the model:

epochs = 30
batch_size = 10
run_name = f"b_{batch_size}_e_{epochs}_{current_timestamp}"

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=1000,
    weight_decay=0.005,
    logging_dir='logs/{}'.format(run_name),
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    report_to="tensorboard",
    run_name=run_name,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

What changes could I try to improve the model’s performance with this larger dataset?

Thanks for any suggestions.