Help needed with issues while trying fine-tune Whisper

I’m fairly new to HF but trying some intermediate stuff, but the problems fall under very basic.

I’m trying to finetune whisper multilingual models with Common Voice v13.0 language datasets to see how far I can go, it will be round-up. I have the datasets on disk and I have custom splits, so I don’t use the default HF-provided datasets. I mainly used this tutorial.

My machine is 6/12 core Windows 10x, RTX-3090, 48 GB RAM with nVME & SSD + large mechanical data disks.

I checked many other documentation but I could not find a reasonable solution to the following problems:

  1. Whenever I cast/convert the Audio field, RAM easily fills up and huge pagefiles got created. These became 120+GB size with my custom splits with ~65k training set and of course, drops the performance a lot due to disk access although nVME/SSD. Is this normal? What are my options for whisper & Audio?
  2. After the conversion I save the dataset so that in consecutive runs I can load it. The arrow files are huge, again 100GB or so… Is there a way to create a compressed version of it (I checked, it can be 8% of the original).
  3. During Trainer.train, VRAM fills with 16GB data (out of available 24GB) with 64 batch size. On the other hand, CUDA cores are not very much used at all (using FP16). Windows Task Manager shows some spikes every 10 sec or so, but it is virtually not used. After a couple of steps, it reports ~35 sec/iteration and reports ~38+ hours to complete with 31k train/10k dev size, which is way too slow. What can be the reason?

Please advise where to look so that I can overcome these…
Thank you in advance.

Training arguments:

    overwrite_output_dir = True, 

    do_train  = True,
    do_eval  = True,
    evaluation_strategy = "steps",

    per_device_train_batch_size = 64,
    gradient_accumulation_steps = 4,
    per_device_eval_batch_size = 64,
    eval_accumulation_steps = 4,

    learning_rate = 1e-5,
    warmup_steps = 500,

    optim = "adamw_torch",

    save_steps = 1000,
    eval_steps = 1000,
    logging_steps = 25,
    max_steps = 4000,

    no_cuda = False,
    fp16 = True,

    gradient_checkpointing = True,
    predict_with_generate = True,
    generation_max_length = 225,

    load_best_model_at_end = True,
    metric_for_best_model = "wer",
    greater_is_better = False,

    report_to= "none",
    push_to_hub = False,

For those who have similar problems, here are some solutions I found after I posted:

  1. Very large swap: Yes, this happens. My solution is as follows:
  • I sorted the train dataset (which is the large one) by audio duration, before the conversion.
  • Calculated a “shard_size” (10-15 GB) and “shard_count”
  • I converted with batches and saved them under train/shard_NN in a subroutine
  • That way, only a part of the converted dataset is kept in memory, so the swapping issue is no more.
  1. Large disk space used: I could not find a method to compress them programmatically. Yes, I could use tarfile in the expanse of too much process time but I did not want that. During fine-tuning, after each epoch that data should be decompressed, which is too much. I also had a large drive which has a large block size, so Windows’ standard compress was also not working. So:
  • I found CompactGUI and used that on the “converted” directory, which dropped the size to ~8-12% of the original. It will be a one-time compress, and it works transparently at OS level.
  1. Low CUDA usage: I was trying the tiny model. I figured out that actually the GPU is starving. At any point, you cannot fully utilize both CPU, GPU and disk, depending on the model and your config. E.g. on larger models (small), this time CPU is not used fully. In any case the following helped a lot to feed the GPU more effectively:
  • dataloader_num_workers = psutil.cpu_count(logical=True) (12 threads in my case)