Technical clarification on the validation data vs. the training data in the trainer API

Hi,

I am using the Trainer API to fine-tune my models but I realized I wanted to clarify something about the training and the evaluation datasets as they appear in

trainer = Trainer(
      model,
      args,
      train_dataset=tokenized_datasets['train'],
      eval_dataset=tokenized_datasets['test'],
      tokenizer=tokenizer,
      compute_metrics=compute_metrics
  )

My understand is the following:

For each batch of data from the training data,

  1. The loss is computed ONLY for that batch. Then a gradient descent (or other algorithm) will tweak the current parameters to make the loss smaller at the next iteration (batch).
  2. move to the next batch of the traning data
  3. Then, at the end of the epoch, the current model (with the new weights from 1. and 2. repeated for the current epoch) is applied to the full eval_dataset, predictions are computed and accuracy metrics (say “accuracy” or “precision”) are shown in the console.

In other words, the eval_dataset is NEVER used for training. Its only purpose is to provide (at the cost of “consuming” some of the data) a rough measure of the out of sample error rate. Training only stops when the number of epochs have been consumed.

Is that 100% correct?
Thanks!

1 Like

of course, I know that usually the validation data is not used for training. I want to be sure that this is the case here as well. Using the Trainer API is a bit more opaque than using my own splits…

Any clarification would be greatly welcome! Thanks