Good evening.
Trainer has a behavior I do not fully understand (I am a beginner) and it gets stuck for a lot of time after a fixed number of steps, and does not proceed in training.
Here code and output at which it gets stuck.
code snippet:
training_args = TrainingArguments(
output_dir=f"{'dataset.coco'.replace(' ', '-')}-finetune",
num_train_epochs=20,
max_grad_norm=0.1,
learning_rate=5e-5,
#warmup_steps=300,
per_device_train_batch_size=1,
#gradient_accumulation_steps=4,
dataloader_num_workers=0,
metric_for_best_model="eval_map",
greater_is_better=True,
load_best_model_at_end=True,
eval_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
remove_unused_columns=False,
eval_do_concat_batches=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=pytorch_dataset_train,
eval_dataset=pytorch_dataset_valid,
tokenizer=processor,
data_collator=collate_fn,
compute_metrics=eval_compute_metrics_fn,
)
trainer.train()
output:
5%|β | 500/10320 [02:01<46:19, 3.53it/s]{βlossβ: 56.0595, βgrad_normβ: 163.28176879882812, βlearning_rateβ: 4.757751937984497e-05, βepochβ: 0.97}
5%|β | 516/10320 [02:06<45:03, 3.63it/s]
0%| | 0/19 [00:00<?, ?it/s]
11%|β | 2/19 [00:00<00:03, 4.37it/s]
16%|ββ | 3/19 [00:00<00:05, 3.12it/s]
21%|ββ | 4/19 [00:01<00:05, 2.73it/s]
26%|βββ | 5/19 [00:01<00:05, 2.53it/s]
32%|ββββ | 6/19 [00:02<00:05, 2.41it/s]
37%|ββββ | 7/19 [00:02<00:05, 2.35it/s]
42%|βββββ | 8/19 [00:03<00:04, 2.32it/s]
47%|βββββ | 9/19 [00:03<00:04, 2.29it/s]
53%|ββββββ | 10/19 [00:04<00:03, 2.28it/s]
58%|ββββββ | 11/19 [00:04<00:03, 2.27it/s]
63%|βββββββ | 12/19 [00:04<00:03, 2.25it/s]
68%|βββββββ | 13/19 [00:05<00:02, 2.24it/s]
74%|ββββββββ | 14/19 [00:05<00:02, 2.24it/s]
79%|ββββββββ | 15/19 [00:06<00:01, 2.24it/s]
84%|βββββββββ | 16/19 [00:06<00:01, 2.24it/s]
89%|βββββββββ | 17/19 [00:07<00:00, 2.23it/s]
95%|ββββββββββ| 18/19 [00:07<00:00, 2.36it/s]
100%|ββββββββββ| 19/19 [00:07<00:00, 2.89it/s]
I thank you all in advance.