I’m experimenting with whisper fine-tuning and encounter unreasonable long wait times after the evaluation phase finishes and the checkpoint gets generated.
During this period, I don’t see any reasonable work in Windows Task Manager, i.e. no GPU cuda/copy works, CPU ~13%, and no disk activity (only the core where the main thread is running has a bit higher usage).
In the following example, I use an augmented Common Voice language set with eval_steps=save_steps=600. I’m using Trainer.train() on GPU.
Here are the durations on i7-8900K & rtx-3090 for that 600 steps
Train Phase: 0:21:43
Eval Phase: 0:08:28 Extra Wait : 0:13:15
After the waiting part, the process continues. I’m using the following if applicable:
No, I lived with it. But I did not train something for a while, so newer versions can be better - I hope.
When posting this, I forgot that period is the backpropagation duration, which seems to be single core (I think it is at the last logical core). I wouldn’t expected that to be so long though.
Hi @FDM1 , can you provide a resource I can use to learn how to switch to it?
I am trying to finetune whisper and saving checkpoints takes 2 hours!!!
How do I implement this?