I have a large audio dataset that might be much easier to process if it could be streamed. The problem is that I cannot upload it to the hub because of license restrictions; I can only use it locally on our cluster. The only way possible to stream seems to be to upload it to the hub. Is that correct (in which case, I am out of luckā¦).
ids = ds.to_iterable() # optional pass num_shards= if you want to shuffle later or for parallel loading with a dataloader
by writing dataset to parquet and stream it later:
ds.to_parquet("parquet_dir/train.parquet")
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")
and if your dataset is big you can even save it in shards:
num_shards = 16
for index in range(num_shards):
shard = ds.shard(num_shards=num_shards, index=index, contiguous=True)
shard.to_parquet(f"parquet_dir/train-{index:05d}-of-{num_shards:05d}.parquet")
# later
ids = load_dataset("parquet_dir", streaming=True, split="train")