How do you save an IterableDataset to disk?

Hey all,

I’m trying to work on an example involving AI training using datasets that are bigger than both ram and system disk. This means I’m loading datasets with streaming=True. I have S3 configured as a filesystem at /home/ubuntu/infinity, which allows for storing data that is bigger than the system disk.

dataset_name = "Some/LargeDataset"

# Load dataset from Hugging Face Hub, split a training subset, enable streaming mode
dataset = load_dataset(dataset_name, split="train", streaming=True)

The problem is that when I do this, the save_to_disk method is no longer usable:

> dataset.save_to_disk(f"/home/ubuntu/infinity/raw/{dataset_name}")
AttributeError: 'IterableDataset' object has no attribute 'save_to_disk'

How do I save the data to disk? Wouldn’t it be reasonable to iterate over chunks of entries in the dataset and then save those to disk in a loop?

1 Like

It seems that it is necessary to convert it to a normal Dataset first.

That seems kinda unsatisfiable for this, would I have to shard it or something? IE: collect up to n (where n>=50,000) records into an in-memory Dataset and then save/load those?

2 Likes

Hmm, it’s quite a problem when the data set is large…