I am training a model using common voice data. The data are sharded in the common voice dataset:
common_voice = load_dataset("mozilla-foundation/common_voice_17_0", "ca", streaming=True)
IterableDatasetDict({
train: IterableDataset({
features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
n_shards: 29
})
validation: IterableDataset({
features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
n_shards: 1
})
test: IterableDataset({
features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
n_shards: 1
})
other: IterableDataset({
features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
n_shards: 13
})
invalidated: IterableDataset({
features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
n_shards: 3
})
validated: IterableDataset({
features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
n_shards: 46
})
})
I tried unsharding this data, but it fills the RAM quickly. What is the best way to (1) process the audio data including resampling the audio file and filtering for records less than 30s, as in tutorial and (2) feed the train and validation splits into a model, as in the same tutorial.
I read one way is transforming the data on-the-fly using set_transform. I haven’t seen examples for how to incorporate this into the code, i.e. before passing to the Seq2Seq trainer or in that function call.