Stream Audio Dataset that Can't be moved to Hub

I have a large audio dataset that might be much easier to process if it could be streamed. The problem is that I cannot upload it to the hub because of license restrictions; I can only use it locally on our cluster. The only way possible to stream seems to be to upload it to the hub. Is that correct (in which case, I am out of luckā€¦).

Thanks
Michael Picheny

Hi ! You can definitely stream locally. For example using load_dataset(..., streaming=True) on local files

I thought you could only use save_to_disk and load_from_disk to create local files, and that load_from_disk does not support streamingā€¦

THanks
Michael

Yes indeed, save_to_disk creates local files and load_from_disk doesnā€™t support streaming from those local files (yet ?).

Only load_dataset does right now.

Thanks. So unless the data is uploaded, there are no options for streaming. Is that correct?

Best
Michael

You can save the dataset to parquet locally using to_parquet, and then reload the parquet data in streaming mode using load_dataset :slight_smile:

I have a similiar problemā€¦ @lhoestq Could you give us a bigger code example of the idea you suggested?

Sure, here you go :slight_smile:

without writing new files:

ids = ds.to_iterable()  # optional pass num_shards= if you want to shuffle later or for parallel loading with a dataloader

by writing dataset to parquet and stream it later:

ds.to_parquet("parquet_dir/train.parquet")
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")

and if your dataset is big you can even save it in shards:

num_shards = 16
for index in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=index, contiguous=True)
    shard.to_parquet(f"parquet_dir/train-{index:05d}-of-{num_shards:05d}.parquet")
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")