I’m fairly new to HF but trying some intermediate stuff, but the problems fall under very basic.
I’m trying to finetune whisper multilingual models with Common Voice v13.0 language datasets to see how far I can go, it will be round-up. I have the datasets on disk and I have custom splits, so I don’t use the default HF-provided datasets. I mainly used this tutorial.
My machine is 6/12 core Windows 10x, RTX-3090, 48 GB RAM with nVME & SSD + large mechanical data disks.
I checked many other documentation but I could not find a reasonable solution to the following problems:
- Whenever I cast/convert the Audio field, RAM easily fills up and huge pagefiles got created. These became 120+GB size with my custom splits with ~65k training set and of course, drops the performance a lot due to disk access although nVME/SSD. Is this normal? What are my options for whisper & Audio?
- After the conversion I save the dataset so that in consecutive runs I can load it. The arrow files are huge, again 100GB or so… Is there a way to create a compressed version of it (I checked, it can be 8% of the original).
- During Trainer.train, VRAM fills with 16GB data (out of available 24GB) with 64 batch size. On the other hand, CUDA cores are not very much used at all (using FP16). Windows Task Manager shows some spikes every 10 sec or so, but it is virtually not used. After a couple of steps, it reports ~35 sec/iteration and reports ~38+ hours to complete with 31k train/10k dev size, which is way too slow. What can be the reason?
Please advise where to look so that I can overcome these…
Thank you in advance.
Training arguments:
overwrite_output_dir = True,
do_train = True,
do_eval = True,
evaluation_strategy = "steps",
per_device_train_batch_size = 64,
gradient_accumulation_steps = 4,
per_device_eval_batch_size = 64,
eval_accumulation_steps = 4,
learning_rate = 1e-5,
warmup_steps = 500,
optim = "adamw_torch",
save_steps = 1000,
eval_steps = 1000,
logging_steps = 25,
max_steps = 4000,
no_cuda = False,
fp16 = True,
gradient_checkpointing = True,
predict_with_generate = True,
generation_max_length = 225,
load_best_model_at_end = True,
metric_for_best_model = "wer",
greater_is_better = False,
report_to= "none",
push_to_hub = False,