Downloading Large Dataset to HDFS: Issues with save_to_disk Method

kdwon · August 29, 2024, 5:11am

Problem:
I’m trying to download a dataset that is larger than my local machine’s storage capacity. I have access to HDFS, which has sufficient space to hold the entire dataset. However, I’m encountering difficulties using the save_to_disk method without downloading the entire dataset first.

Code Attempt:

db = load_dataset_builder("...")
parquets = db.config.data_files['train']
hdfs = fsspec.filesystem('hdfs', host='...')
for parq in tqdm(parquets, total=len(parquets)):
    ds = load_dataset("...", data_files=parq)
    ds.save_to_disk("...", fs=hdfs)

Error:
Running the above script results in a datasets.exceptions.NonMatchingSplitsSizesError. It seems that saving the dataset requires it to be identical to the entire split size.

Questions:

Is there a way to use save_to_disk without downloading the entire dataset first?
Are there alternative methods to download a large dataset directly to HDFS?

Any insights or suggestions would be greatly appreciated. Thank you!

Topic		Replies	Views
How do you save an IterableDataset to disk? 🤗Datasets	3	753	November 18, 2024
Working with large datasets 🤗Datasets	5	4139	November 10, 2020
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024
How to save audio dataset with parquet format on disk 🤗Datasets	2	2076	December 19, 2023
Compressing, saving, and loading datasets 🤗Datasets	3	2256	November 10, 2020

Downloading Large Dataset to HDFS: Issues with save_to_disk Method

Related topics