Validation VS Test with Transformers Trainer

Hi, newbie here, I’m fine-tuning Roberta-base with the code I’ve attached below and have some questions:

  1. Have I understood it correctly that the training process used here will contaminate dataset used for evaluation? Or could the validation data here be considered test-data, and I could simply do an 80/20 split? I’ve read I need to use a validation set when doing hyperparameter tuning, is such tuning done behind the scene (when calculating loss?)
  2. When I run trainer.evaluate, will it automatically use the evaluation dataset? For final testing, should I specify the last part of the dataset, in this case, split='train[90%:]

A lot of tutorials called the evaluation dataset “test-data”, which made me a bit confused. Few tutorials also go through the process of first validating, then testing.

train_data = datasets.load_dataset('csv', data_files = 'datasets/all_shuffled.csv', split='train[:80%]')
vali_data = datasets.load_dataset('csv', data_files = 'datasets/all_shuffled.csv', split='train[80%:90%]')

training_args = TrainingArguments(
    output_dir = 'roberta',
    num_train_epochs=4,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 16,    
    per_device_eval_batch_size= 8,
    evaluation_strategy = 'no',
    save_strategy = 'no',
    disable_tqdm = False, 
    load_best_model_at_end=True,
    warmup_steps=500,
    weight_decay=0.01,
    logging_steps = 8,
    fp16 = False,
    logging_dir='roberta/logs',
    dataloader_num_workers = 8,
    run_name = 'roberta-classification'
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=vali_data
)

trainer.train()
trainer.evaluate()
1 Like

Hello,
Just like you, all your questions came across my mind when trying to fine-tune the first model. I was using the test dataset as a validation dataset, but this is totally wrong. You have to split the data into three datasets and make sure that the test dataset is unseen, to guarantee that there is not a data leak.

So, let’s make sure that the validation dataset is not the same as the test dataset.

The method evaluation takes the “tokenized test dataset” as a parameter to be used. I tried BERT, but I think it is the same as your model.

I hope I answered your question

1 Like

Thanks for your input, this makes sense!