I experiment with whisper finetuning with custom splits of Common Voice, with augmentation and such - from the datasets on my disk. As I’ll use the same augmented & converted dataset for multiple training sessions, and it takes 1-2 hours to convert, I save them on disk with a custom convert.py
script, ensuring comparability.
I have these two deal breakers:
- When converted, they become huge and do not fit in RAM, and some 200 GB page files are created. That downgrades the nVME/SSD disks and the paging decreases performance.
- At some point (>100k recordings), when finishing, it tries to allocate 34 GB from an already full RAM (48GB) and it crashes.
I read the whole documentation and many posts from this forum but could not figure out how to approach the problem.
- I must be able to code to run the preprocessor (tokenizer & feature_extractor) on a batch of records and output them to the final destination. But how?
- Or should I divide the dataset into multiple datasets?
- Should I combine similar-length audio and save them separately (this is already advised for training)?
Here is the caller code:
batch_size: int = calc_batch_size(recs_total)
g.CV = g.CV.map(
preprocessor,
# batched=True,
# batch_size=batch_size,
remove_columns=["audio", "sentence"],
num_proc=g.MAX_NUM_PROCS,
desc="preprocess",
)
And preprocessor:
def preprocessor(batch):
audio = batch["audio"]
batch["input_features"] = feature_extractor(
audio["array"], sampling_rate=audio["sampling_rate"]
).input_features[0]
input_str = normalizer(batch["sentence"]).strip()
batch["labels"] = tokenizer(input_str).input_ids
return batch
Thank you in advance for any ideas & pointers…