HF Trainer progress bar not progressing after first epoch

dkoterwa · May 10, 2023, 8:43am

Hello,
I am fine-tuning BERT for token classification task. I’ve decided to use the HF Trainer to facilitate the process. I’ve created training and testing datasets, data collator, training arguments, and compute_metrics function.

The progress bar shows up at the beginning of training and for first evaluation process, but then it stops progressing. The information about further evaluations is printed and training finishes, I also have access to all statistics by using state.log_history after training.

I would like to see all of the information from progress bar, and also the summary table, because it shows up, but does not append.

These are my TrainingArguments and some information being printed by the Trainer

args = TrainingArguments(
    output_dir = "/dir",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    push_to_hub=False,
    report_to="all"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    eval_dataset=test_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

Information printed by the Trainer during training

***** Running training *****
  Num examples = 1300
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 210
  Number of trainable parameters = 94111492
***** Running Evaluation *****
  Num examples = 450
  Batch size = 64
  Num examples = 450
  Batch size = 64
Saving model checkpoint to /dir
Configuration saved in /dir/config.json
Model weights saved in /dir/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 450
  Batch size = 64
Saving model checkpoint to /dir
Configuration saved in /dir/config.json
Model weights saved in /dir/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 450
  Batch size = 64
Saving model checkpoint to /dir
Configuration saved in /dir/config.json
Model weights saved in /dir/pytorch_model.bin

As you can see, the evaluation is being executed, while the progress bar stops progressing after first evaluation epoch. It goes to 6/8 in first evaluation and stops there.
Capture

There is the output of trainer.state.log_history

[{'loss': 0.0798, 'learning_rate': 1.8e-05, 'epoch': 1.0, 'step': 21},
 {'eval_loss': 0.30890047550201416,
  'eval_macro_precision': 0.91,
  'eval_macro_recall': 0.85,
  'eval_macro_f1': 0.88,
  'eval_runtime': 4.4862,
  'eval_samples_per_second': 100.307,
  'eval_steps_per_second': 1.783,
  'epoch': 1.0,
  'step': 21},
 {'loss': 0.0262,
  'learning_rate': 1.6000000000000003e-05,
  'epoch': 2.0,
  'step': 42},
 {'eval_loss': 0.3189994990825653,
  'eval_macro_precision': 0.91,
  'eval_macro_recall': 0.87,
  'eval_macro_f1': 0.89,
  'eval_runtime': 4.6567,
  'eval_samples_per_second': 96.634,
  'eval_steps_per_second': 1.718,
  'epoch': 2.0,
  'step': 42},
 {'loss': 0.0161, 'learning_rate': 1.4e-05, 'epoch': 3.0, 'step': 63},
 {'eval_loss': 0.3352892994880676,
...
  'train_steps_per_second': 1.044,
  'total_flos': 849229814784000.0,
  'train_loss': 0.01752198324317024,
  'epoch': 10.0,
  'step': 210}]

Topic		Replies	Views
Trainer does not show epochs or steps just 1 line without numbers Course	0	412	October 5, 2023
Progress bars shown despite disable_tqdm=True in Trainer Beginners	2	7885	May 4, 2023
How to not show the progress bar for evaluation only? 🤗Transformers	1	596	April 24, 2024
Is HF Trainer checkpointing usable? Community Calls	0	18	September 5, 2024
How to log Trainer's training progress bars into a file 🤗Transformers	2	1785	December 5, 2024

HF Trainer progress bar not progressing after first epoch

Related topics