Stream Audio Dataset that Can't be moved to Hub

picheny · February 1, 2023, 6:52pm

I have a large audio dataset that might be much easier to process if it could be streamed. The problem is that I cannot upload it to the hub because of license restrictions; I can only use it locally on our cluster. The only way possible to stream seems to be to upload it to the hub. Is that correct (in which case, I am out of luck…).

Thanks
Michael Picheny

lhoestq · February 2, 2023, 3:34pm

Hi ! You can definitely stream locally. For example using load_dataset(..., streaming=True) on local files

picheny · February 2, 2023, 5:46pm

I thought you could only use save_to_disk and load_from_disk to create local files, and that load_from_disk does not support streaming…

THanks
Michael

lhoestq · February 7, 2023, 6:37pm

Yes indeed, save_to_disk creates local files and load_from_disk doesn’t support streaming from those local files (yet ?).

Only load_dataset does right now.

picheny · February 8, 2023, 4:00am

Thanks. So unless the data is uploaded, there are no options for streaming. Is that correct?

Best
Michael

lhoestq · February 16, 2023, 12:04pm

You can save the dataset to parquet locally using to_parquet, and then reload the parquet data in streaming mode using load_dataset

vimey · March 17, 2023, 10:00am

I have a similiar problem… @lhoestq Could you give us a bigger code example of the idea you suggested?

lhoestq · March 17, 2023, 11:17am

Sure, here you go

without writing new files:

ids = ds.to_iterable()  # optional pass num_shards= if you want to shuffle later or for parallel loading with a dataloader

by writing dataset to parquet and stream it later:

ds.to_parquet("parquet_dir/train.parquet")
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")

and if your dataset is big you can even save it in shards:

num_shards = 16
for index in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=index, contiguous=True)
    shard.to_parquet(f"parquet_dir/train-{index:05d}-of-{num_shards:05d}.parquet")
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")

Topic		Replies	Views
Question about streaming 🤗Datasets	3	573	April 25, 2023
Unable to upload large audio dataset using push_to_hub 🤗Datasets	5	855	November 17, 2023
Standard way to upload huge dataset 🤗Datasets	5	603	April 26, 2024
How to save audio dataset with parquet format on disk 🤗Datasets	2	2075	December 19, 2023
Streaming in dataset uploads 🤗Datasets	2	50	March 31, 2025

Stream Audio Dataset that Can't be moved to Hub

Related topics