I’m experimenting with whisper fine-tuning and encounter unreasonable long wait times after the evaluation phase finishes and the checkpoint gets generated.
During this period, I don’t see any reasonable work in Windows Task Manager, i.e. no GPU cuda/copy works, CPU ~13%, and no disk activity (only the core where the main thread is running has a bit higher usage).
In the following example, I use an augmented Common Voice language set with eval_steps=save_steps=600. I’m using Trainer.train() on GPU.
Here are the durations on i7-8900K & rtx-3090 for that 600 steps
Train Phase: 0:21:43
Eval Phase: 0:08:28
Extra Wait : 0:13:15
After the waiting part, the process continues. I’m using the following if applicable:
per_device_train_batch_size = 64,
gradient_accumulation_steps = 1,
per_device_eval_batch_size = 16,
eval_accumulation_steps = 1,
optim = "adamw_torch",
tf32 = True,
fp16 = True,
gradient_checkpointing = True,
predict_with_generate = True,
...etc...
Where should I look? What can be the culprit?