Hello,
I am fine-tuning BERT for token classification task. I’ve decided to use the HF Trainer to facilitate the process. I’ve created training and testing datasets, data collator, training arguments, and compute_metrics
function.
The progress bar shows up at the beginning of training and for first evaluation process, but then it stops progressing. The information about further evaluations is printed and training finishes, I also have access to all statistics by using state.log_history
after training.
I would like to see all of the information from progress bar, and also the summary table, because it shows up, but does not append.
These are my TrainingArguments
and some information being printed by the Trainer
args = TrainingArguments(
output_dir = "/dir",
evaluation_strategy="epoch",
save_strategy="epoch",
logging_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=10,
weight_decay=0.01,
push_to_hub=False,
report_to="all"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_data,
eval_dataset=test_data,
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
Information printed by the Trainer during training
***** Running training *****
Num examples = 1300
Num Epochs = 10
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 1
Total optimization steps = 210
Number of trainable parameters = 94111492
***** Running Evaluation *****
Num examples = 450
Batch size = 64
Num examples = 450
Batch size = 64
Saving model checkpoint to /dir
Configuration saved in /dir/config.json
Model weights saved in /dir/pytorch_model.bin
***** Running Evaluation *****
Num examples = 450
Batch size = 64
Saving model checkpoint to /dir
Configuration saved in /dir/config.json
Model weights saved in /dir/pytorch_model.bin
***** Running Evaluation *****
Num examples = 450
Batch size = 64
Saving model checkpoint to /dir
Configuration saved in /dir/config.json
Model weights saved in /dir/pytorch_model.bin
As you can see, the evaluation is being executed, while the progress bar stops progressing after first evaluation epoch. It goes to 6/8 in first evaluation and stops there.
There is the output of trainer.state.log_history
[{'loss': 0.0798, 'learning_rate': 1.8e-05, 'epoch': 1.0, 'step': 21},
{'eval_loss': 0.30890047550201416,
'eval_macro_precision': 0.91,
'eval_macro_recall': 0.85,
'eval_macro_f1': 0.88,
'eval_runtime': 4.4862,
'eval_samples_per_second': 100.307,
'eval_steps_per_second': 1.783,
'epoch': 1.0,
'step': 21},
{'loss': 0.0262,
'learning_rate': 1.6000000000000003e-05,
'epoch': 2.0,
'step': 42},
{'eval_loss': 0.3189994990825653,
'eval_macro_precision': 0.91,
'eval_macro_recall': 0.87,
'eval_macro_f1': 0.89,
'eval_runtime': 4.6567,
'eval_samples_per_second': 96.634,
'eval_steps_per_second': 1.718,
'epoch': 2.0,
'step': 42},
{'loss': 0.0161, 'learning_rate': 1.4e-05, 'epoch': 3.0, 'step': 63},
{'eval_loss': 0.3352892994880676,
...
'train_steps_per_second': 1.044,
'total_flos': 849229814784000.0,
'train_loss': 0.01752198324317024,
'epoch': 10.0,
'step': 210}]