I experiment with whisper finetuning with custom splits of Common Voice, with augmentation and such - from the datasets on my disk. As I’ll use the same augmented & converted dataset for multiple training sessions, and it takes 1-2 hours to convert, I save them on disk with a custom
convert.py script, ensuring comparability.
I have these two deal breakers:
- When converted, they become huge and do not fit in RAM, and some 200 GB page files are created. That downgrades the nVME/SSD disks and the paging decreases performance.
- At some point (>100k recordings), when finishing, it tries to allocate 34 GB from an already full RAM (48GB) and it crashes.
I read the whole documentation and many posts from this forum but could not figure out how to approach the problem.
- I must be able to code to run the preprocessor (tokenizer & feature_extractor) on a batch of records and output them to the final destination. But how?
- Or should I divide the dataset into multiple datasets?
- Should I combine similar-length audio and save them separately (this is already advised for training)?
Here is the caller code:
batch_size: int = calc_batch_size(recs_total) g.CV = g.CV.map( preprocessor, # batched=True, # batch_size=batch_size, remove_columns=["audio", "sentence"], num_proc=g.MAX_NUM_PROCS, desc="preprocess", )
def preprocessor(batch): audio = batch["audio"] batch["input_features"] = feature_extractor( audio["array"], sampling_rate=audio["sampling_rate"] ).input_features input_str = normalizer(batch["sentence"]).strip() batch["labels"] = tokenizer(input_str).input_ids return batch
Thank you in advance for any ideas & pointers…