How to fit custom audio dataset during pre-process? Batch? Stream? Shard?

I experiment with whisper finetuning with custom splits of Common Voice, with augmentation and such - from the datasets on my disk. As I’ll use the same augmented & converted dataset for multiple training sessions, and it takes 1-2 hours to convert, I save them on disk with a custom convert.py script, ensuring comparability.

I have these two deal breakers:

  • When converted, they become huge and do not fit in RAM, and some 200 GB page files are created. That downgrades the nVME/SSD disks and the paging decreases performance.
  • At some point (>100k recordings), when finishing, it tries to allocate 34 GB from an already full RAM (48GB) and it crashes.

I read the whole documentation and many posts from this forum but could not figure out how to approach the problem.

  • I must be able to code to run the preprocessor (tokenizer & feature_extractor) on a batch of records and output them to the final destination. But how?
  • Or should I divide the dataset into multiple datasets?
  • Should I combine similar-length audio and save them separately (this is already advised for training)?

Here is the caller code:

            batch_size: int = calc_batch_size(recs_total)
            g.CV = g.CV.map(
                preprocessor,
                # batched=True,
                # batch_size=batch_size,
                remove_columns=["audio", "sentence"],
                num_proc=g.MAX_NUM_PROCS,
                desc="preprocess",
            )

And preprocessor:

    def preprocessor(batch):
        audio = batch["audio"]
        batch["input_features"] = feature_extractor(
            audio["array"], sampling_rate=audio["sampling_rate"]
        ).input_features[0]
        input_str = normalizer(batch["sentence"]).strip()
        batch["labels"] = tokenizer(input_str).input_ids
        return batch

Thank you in advance for any ideas & pointers…

After some more reading, I introduced sharding into the workflow. After some tests + a shard calculation function and with the help of garbage collection, and multi-core processing, I now can optimize CPU, RAM, and disk access (no-swap)…
I lost my metadata from DatasetDict thou… A problem for another day…