Long wait time between evaluate and save (checkpoint creation)

I’m experimenting with whisper fine-tuning and encounter unreasonable long wait times after the evaluation phase finishes and the checkpoint gets generated.

During this period, I don’t see any reasonable work in Windows Task Manager, i.e. no GPU cuda/copy works, CPU ~13%, and no disk activity (only the core where the main thread is running has a bit higher usage).

In the following example, I use an augmented Common Voice language set with eval_steps=save_steps=600. I’m using Trainer.train() on GPU.

Here are the durations on i7-8900K & rtx-3090 for that 600 steps

Train Phase: 0:21:43
Eval Phase: 0:08:28
Extra Wait : 0:13:15

After the waiting part, the process continues. I’m using the following if applicable:

    per_device_train_batch_size = 64,
    gradient_accumulation_steps = 1,
    per_device_eval_batch_size = 16,
    eval_accumulation_steps = 1,

    optim = "adamw_torch",
    tf32 = True,
    fp16 = True,

    gradient_checkpointing = True,
    predict_with_generate = True,

...etc...

Where should I look? What can be the culprit?

1 Like

Here is the CPU usage during the wait period, except the Python code, I have browsers and VS Code open.

image

2 Likes

Having same problem. Did you solve in some way?

Tnx a lot in advance.

1 Like

No, I lived with it. But I did not train something for a while, so newer versions can be better - I hope.

When posting this, I forgot that period is the backpropagation duration, which seems to be single core (I think it is at the last logical core). I wouldn’t expected that to be so long though.

Just a guess, without looking into details…

1 Like

It did not get better I am affraid.
I opened another topic. Hoping someone more experienced will tell

1 Like

I solved by implementing training with pytorch lightning.

This example push model to hub but you can easily get a .ckpt classic torch format to reload.

@FDM1, wrong thread?

Nono. If you use pytorch lighting for training it is faster in both training and saving model.
Link is example implementation for custom dataset.

1 Like

Oh, never used it, I’ll look at it. Thank you for sharing.

1 Like

Hi @FDM1 , can you provide a resource I can use to learn how to switch to it?
I am trying to finetune whisper and saving checkpoints takes 2 hours!!!
How do I implement this?

Thanks in advance.