Question about streaming

I have a huge audio dataset, I process this dataset using code like this:

dataset = load_dataset(load_script_path, configuration, cache_dir=self.cache_dir, split=split)
dataset = dataset.cast_column('audio', Audio(sampling_rate=self.sample_rate))
dataset = dataset.map(self.prepare_dataset, remove_columns=dataset.column_names, num_proc=self.num_proc)
dataset.save_to_disk(save_path)

My question is when I use save_to_disk after map, is there a possible way to load this saved dataset streaming, because the time use of map is huge, thanks a lot.

I also notice that save_to_disk split the dataset into many shards, so I think there maybe a way to load it streaming, but I don’t know how

You can use load_from_disk to load a saved dataset, but it doesn’t have the streaming argument (yet ?). But if you want to get an iterable dataset you can do

idataset = load_from_disk(...).to_iterable_dataset()

yes, I do it that way, thanks