Trainer epoch does not go through all training data?

Hello

I’m training a model with transformers Trainer but when I set the number of epoch to eg: 1000 then it seems the training just does 1000 steps however an epoch is normally the number of times the model goes through the entire dataset. Thus, how can we use the trainer such that each epoch goes through the full training dataset (and that we see the progression of these)

Thanks!

1 Like

Hi there!

Please post the command/code you are executing as we can’t really help without that.

Sure, sorry! I thought it was not a code specific question but rather about the parameters of the Trainer class :slight_smile:

Here is the code I use:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir=’./out’,
num_train_epochs=1000,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
warmup_steps=500,
weight_decay=0.01,
logging_dir=’./logs’,
logging_steps=10,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=training_data,
eval_dataset=validation_data)

trainer.train()

I would expect that there are 1000 epoch that goes through the full train_dataset but the output goes very fast and prints “epoch 1/1000, epoch 2/1000, epoch 3/1000, …” which give the impression that the epoch is simply a training step rather than an actual epoch. However, having 20k training instances I expect to see 20k steps for each epoch and this normally take some time.

I wanted to check your num_train_epochs wasn’t overridden by other parameters like max_steps. The code looks correct and the logs do indicate you are going through the epochs. Double check the length of your dataset to make sure it’s not reduced to something small.

My dataset seem to have the right size. Using Wandb logger it seems that there is only 1 step performed in each epoch indeed… I don’t know what could be wrong.
My code is basically done as in this post.

1 Like