How do you save an IterableDataset to disk?

XeIaso · November 14, 2024, 9:28pm

Hey all,

I’m trying to work on an example involving AI training using datasets that are bigger than both ram and system disk. This means I’m loading datasets with streaming=True. I have S3 configured as a filesystem at /home/ubuntu/infinity, which allows for storing data that is bigger than the system disk.

dataset_name = "Some/LargeDataset"

# Load dataset from Hugging Face Hub, split a training subset, enable streaming mode
dataset = load_dataset(dataset_name, split="train", streaming=True)

The problem is that when I do this, the save_to_disk method is no longer usable:

> dataset.save_to_disk(f"/home/ubuntu/infinity/raw/{dataset_name}")
AttributeError: 'IterableDataset' object has no attribute 'save_to_disk'

How do I save the data to disk? Wouldn’t it be reasonable to iterate over chunks of entries in the dataset and then save those to disk in a loop?

John6666 · November 15, 2024, 12:29am

It seems that it is necessary to convert it to a normal Dataset first.

XeIaso · November 18, 2024, 4:10pm

That seems kinda unsatisfiable for this, would I have to shard it or something? IE: collect up to n (where n>=50,000) records into an in-memory Dataset and then save/load those?

John6666 · November 18, 2024, 5:09pm

Hmm, it’s quite a problem when the data set is large…

Topic		Replies	Views
Load iterable dataset from disk Beginners	2	2145	July 21, 2022
Uploading a heavy dataset to Jean-Zay Intermediate	3	64	February 17, 2025
Use load dataset to load a sample of the dataset 🤗Datasets	3	1265	May 24, 2021
Recommended max size of dataset? 🤗Datasets	5	164	March 11, 2025
Question about streaming 🤗Datasets	3	574	April 25, 2023

How do you save an IterableDataset to disk?

Related topics