Problem:
I’m trying to download a dataset that is larger than my local machine’s storage capacity. I have access to HDFS, which has sufficient space to hold the entire dataset. However, I’m encountering difficulties using the save_to_disk
method without downloading the entire dataset first.
Code Attempt:
db = load_dataset_builder("...")
parquets = db.config.data_files['train']
hdfs = fsspec.filesystem('hdfs', host='...')
for parq in tqdm(parquets, total=len(parquets)):
ds = load_dataset("...", data_files=parq)
ds.save_to_disk("...", fs=hdfs)
Error:
Running the above script results in a datasets.exceptions.NonMatchingSplitsSizesError
. It seems that saving the dataset requires it to be identical to the entire split size.
Questions:
- Is there a way to use
save_to_disk
without downloading the entire dataset first? - Are there alternative methods to download a large dataset directly to HDFS?
Any insights or suggestions would be greatly appreciated. Thank you!