I have a huge audio dataset, I process this dataset using code like this:
dataset = load_dataset(load_script_path, configuration, cache_dir=self.cache_dir, split=split)
dataset = dataset.cast_column('audio', Audio(sampling_rate=self.sample_rate))
dataset = dataset.map(self.prepare_dataset, remove_columns=dataset.column_names, num_proc=self.num_proc)
dataset.save_to_disk(save_path)
My question is when I use save_to_disk
after map
, is there a possible way to load this saved dataset streaming, because the time use of map
is huge, thanks a lot.