Stream Audio Dataset that Can't be moved to Hub

I have a large audio dataset that might be much easier to process if it could be streamed. The problem is that I cannot upload it to the hub because of license restrictions; I can only use it locally on our cluster. The only way possible to stream seems to be to upload it to the hub. Is that correct (in which case, I am out of luckā€¦).

Michael Picheny

Hi ! You can definitely stream locally. For example using load_dataset(..., streaming=True) on local files

I thought you could only use save_to_disk and load_from_disk to create local files, and that load_from_disk does not support streamingā€¦


Yes indeed, save_to_disk creates local files and load_from_disk doesnā€™t support streaming from those local files (yet ?).

Only load_dataset does right now.

Thanks. So unless the data is uploaded, there are no options for streaming. Is that correct?


You can save the dataset to parquet locally using to_parquet, and then reload the parquet data in streaming mode using load_dataset :slight_smile:

I have a similiar problemā€¦ @lhoestq Could you give us a bigger code example of the idea you suggested?

Sure, here you go :slight_smile:

without writing new files:

ids = ds.to_iterable()  # optional pass num_shards= if you want to shuffle later or for parallel loading with a dataloader

by writing dataset to parquet and stream it later:

# later
ids = load_dataset("parquet_dir", streaming=True, split="train")

and if your dataset is big you can even save it in shards:

num_shards = 16
for index in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=index, contiguous=True)
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")