Load iterable dataset from disk

Hello, I’ve made a custom dataset and saved it to disk using datasets.save_to_disk. I would like to shuffle it together with common_voice, but don’t currently have the diskspace for common_voice.

I can load common_voice with streaming=True and it becomes an iterable, which works fine.

So, my question is, can I somehow load my saved dataset as an iterable too so I can combine them using interleave_datasets? datasets.load_from_disk doesn’t seem to have that option.

thanks!
Jonathan

Hi! load_from_disk doesn’t support streaming at the moment, but you can save the dataset to JSON for instance and stream it from there with:

custom_ds.to_json("path/to/data/file")
custom_ds_iter = load_dataset("json", data_files="path/to/data/file", split="train", streaming=True, features=custom_ds.features) 
1 Like

Thank you – I finally got this to work on my low resource machine.