Process data shards

jboat · August 7, 2024, 4:48am

I am training a model using common voice data. The data are sharded in the common voice dataset:

common_voice = load_dataset("mozilla-foundation/common_voice_17_0", "ca", streaming=True)
IterableDatasetDict({
    train: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        n_shards: 29
    })
    validation: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        n_shards: 1
    })
    test: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        n_shards: 1
    })
    other: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        n_shards: 13
    })
    invalidated: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        n_shards: 3
    })
    validated: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        n_shards: 46
    })
})

I tried unsharding this data, but it fills the RAM quickly. What is the best way to (1) process the audio data including resampling the audio file and filtering for records less than 30s, as in tutorial and (2) feed the train and validation splits into a model, as in the same tutorial.

I read one way is transforming the data on-the-fly using set_transform. I haven’t seen examples for how to incorporate this into the code, i.e. before passing to the Seq2Seq trainer or in that function call.

Topic		Replies	Views
Batching vs. Sharding a Large Dataset 🤗Datasets	4	2200	June 8, 2021
Fine tune a model from a script-based dataset 🤗Datasets	2	263	January 28, 2023
Load iterable dataset from disk Beginners	2	2127	July 21, 2022
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	62	December 23, 2024
How to process the first 20k samples of a dataset without downloading all of it? 🤗Datasets	1	1271	September 1, 2023

Process data shards

Related topics