Long wait time between evaluate and save (checkpoint creation)

I’m experimenting with whisper fine-tuning and encounter unreasonable long wait times after the evaluation phase finishes and the checkpoint gets generated.

During this period, I don’t see any reasonable work in Windows Task Manager, i.e. no GPU cuda/copy works, CPU ~13%, and no disk activity (only the core where the main thread is running has a bit higher usage).

In the following example, I use an augmented Common Voice language set with eval_steps=save_steps=600. I’m using Trainer.train() on GPU.

Here are the durations on i7-8900K & rtx-3090 for that 600 steps

Train Phase: 0:21:43
Eval Phase: 0:08:28
Extra Wait : 0:13:15

After the waiting part, the process continues. I’m using the following if applicable:

    per_device_train_batch_size = 64,
    gradient_accumulation_steps = 1,
    per_device_eval_batch_size = 16,
    eval_accumulation_steps = 1,

    optim = "adamw_torch",
    tf32 = True,
    fp16 = True,

    gradient_checkpointing = True,
    predict_with_generate = True,

...etc...

Where should I look? What can be the culprit?

Here is the CPU usage during the wait period, except the Python code, I have browsers and VS Code open.

image