How to fit custom audio dataset during pre-process? Batch? Stream? Shard?

bozden · May 25, 2023, 6:13pm

I experiment with whisper finetuning with custom splits of Common Voice, with augmentation and such - from the datasets on my disk. As I’ll use the same augmented & converted dataset for multiple training sessions, and it takes 1-2 hours to convert, I save them on disk with a custom convert.py script, ensuring comparability.

I have these two deal breakers:

When converted, they become huge and do not fit in RAM, and some 200 GB page files are created. That downgrades the nVME/SSD disks and the paging decreases performance.
At some point (>100k recordings), when finishing, it tries to allocate 34 GB from an already full RAM (48GB) and it crashes.

I read the whole documentation and many posts from this forum but could not figure out how to approach the problem.

I must be able to code to run the preprocessor (tokenizer & feature_extractor) on a batch of records and output them to the final destination. But how?
Or should I divide the dataset into multiple datasets?
Should I combine similar-length audio and save them separately (this is already advised for training)?

Here is the caller code:

            batch_size: int = calc_batch_size(recs_total)
            g.CV = g.CV.map(
                preprocessor,
                # batched=True,
                # batch_size=batch_size,
                remove_columns=["audio", "sentence"],
                num_proc=g.MAX_NUM_PROCS,
                desc="preprocess",
            )

And preprocessor:

    def preprocessor(batch):
        audio = batch["audio"]
        batch["input_features"] = feature_extractor(
            audio["array"], sampling_rate=audio["sampling_rate"]
        ).input_features[0]
        input_str = normalizer(batch["sentence"]).strip()
        batch["labels"] = tokenizer(input_str).input_ids
        return batch

Thank you in advance for any ideas & pointers…

bozden · May 26, 2023, 5:07am

After some more reading, I introduced sharding into the workflow. After some tests + a shard calculation function and with the help of garbage collection, and multi-core processing, I now can optimize CPU, RAM, and disk access (no-swap)…
I lost my metadata from DatasetDict thou… A problem for another day…

Topic		Replies	Views
Process data shards 🤗Datasets	0	46	August 7, 2024
Batching vs. Sharding a Large Dataset 🤗Datasets	4	2243	June 8, 2021
Fine tuning whisper on custom dataset Beginners	3	949	January 11, 2024
Help needed with issues while trying fine-tune Whisper Beginners	2	1427	April 19, 2024
German ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	17	3686	February 18, 2022

How to fit custom audio dataset during pre-process? Batch? Stream? Shard?

Related topics