Long wait time between evaluate and save (checkpoint creation)

bozden · May 30, 2023, 12:06am

I’m experimenting with whisper fine-tuning and encounter unreasonable long wait times after the evaluation phase finishes and the checkpoint gets generated.

During this period, I don’t see any reasonable work in Windows Task Manager, i.e. no GPU cuda/copy works, CPU ~13%, and no disk activity (only the core where the main thread is running has a bit higher usage).

In the following example, I use an augmented Common Voice language set with eval_steps=save_steps=600. I’m using Trainer.train() on GPU.

Here are the durations on i7-8900K & rtx-3090 for that 600 steps

Train Phase: 0:21:43
Eval Phase: 0:08:28
Extra Wait : 0:13:15

After the waiting part, the process continues. I’m using the following if applicable:

    per_device_train_batch_size = 64,
    gradient_accumulation_steps = 1,
    per_device_eval_batch_size = 16,
    eval_accumulation_steps = 1,

    optim = "adamw_torch",
    tf32 = True,
    fp16 = True,

    gradient_checkpointing = True,
    predict_with_generate = True,

...etc...

Where should I look? What can be the culprit?

bozden · May 30, 2023, 7:35pm

Here is the CPU usage during the wait period, except the Python code, I have browsers and VS Code open.

FDM1 · September 2, 2024, 2:01pm

Having same problem. Did you solve in some way?

Tnx a lot in advance.

bozden · September 2, 2024, 9:33pm

No, I lived with it. But I did not train something for a while, so newer versions can be better - I hope.

When posting this, I forgot that period is the backpropagation duration, which seems to be single core (I think it is at the last logical core). I wouldn’t expected that to be so long though.

Just a guess, without looking into details…

FDM1 · September 2, 2024, 11:32pm

It did not get better I am affraid.
I opened another topic. Hoping someone more experienced will tell

FDM1 · September 3, 2024, 4:35pm

I solved by implementing training with pytorch lightning.

This example push model to hub but you can easily get a .ckpt classic torch format to reload.

github.com

NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": [],
      "gpuType": "T4",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "72998908b7844f009be2058a601f5a8e": {

This file has been truncated. show original

bozden · September 3, 2024, 6:12pm

@FDM1, wrong thread?

FDM1 · September 4, 2024, 3:07pm

Nono. If you use pytorch lighting for training it is faster in both training and saving model.
Link is example implementation for custom dataset.

bozden · September 4, 2024, 5:52pm

Oh, never used it, I’ll look at it. Thank you for sharing.

ameenmohammad · September 16, 2024, 2:08pm

Hi @FDM1 , can you provide a resource I can use to learn how to switch to it?
I am trying to finetune whisper and saving checkpoints takes 2 hours!!!
How do I implement this?

Thanks in advance.

Topic		Replies	Views
Evaluation step take longer then training step Intermediate	0	818	October 23, 2023
Evaluation step very slow 🤗Transformers	1	840	February 21, 2024
Trainer freezes/crashes after evaluation step 🤗Transformers	6	1588	April 16, 2024
Training models for smaller epochs and then continue trianing 🤗Transformers	5	1318	January 16, 2021
Whisper medium finetuning RTX 4090 mostly stays idle Beginners	5	268	December 7, 2024

Long wait time between evaluate and save (checkpoint creation)

Related topics