Hi!
I have an audio dataset, and I’m using the standard load_dataset
pattern to get the dataset into memory.
dataset = load_dataset("audiofolder", data_dir="../data", streaming=streaming)
And then I use map
with batched=True
to do preprocessing. The problem is that the preprocessing explodes the size of the dataset due to the batching. Is it possible to pass a generator to map
instead of a function? That’d fix the problem in that a single sample would be emitted whenever next
is called instead of making the whole batch in one go.
I know that it’s possible to create an IterableDataset
from a generator, and I can do that as a last resort, but the load_dataset
is quite nice and saves me from doing a bunch of manual librosa stuff.
Thanks in advance for any tips!