Hey all,
I’m trying to work on an example involving AI training using datasets that are bigger than both ram and system disk. This means I’m loading datasets with streaming=True
. I have S3 configured as a filesystem at /home/ubuntu/infinity
, which allows for storing data that is bigger than the system disk.
dataset_name = "Some/LargeDataset"
# Load dataset from Hugging Face Hub, split a training subset, enable streaming mode
dataset = load_dataset(dataset_name, split="train", streaming=True)
The problem is that when I do this, the save_to_disk
method is no longer usable:
> dataset.save_to_disk(f"/home/ubuntu/infinity/raw/{dataset_name}")
AttributeError: 'IterableDataset' object has no attribute 'save_to_disk'
How do I save the data to disk? Wouldn’t it be reasonable to iterate over chunks of entries in the dataset and then save those to disk in a loop?