Downloading Large Dataset to HDFS: Issues with save_to_disk Method

Problem:
I’m trying to download a dataset that is larger than my local machine’s storage capacity. I have access to HDFS, which has sufficient space to hold the entire dataset. However, I’m encountering difficulties using the save_to_disk method without downloading the entire dataset first.

Code Attempt:

db = load_dataset_builder("...")
parquets = db.config.data_files['train']
hdfs = fsspec.filesystem('hdfs', host='...')
for parq in tqdm(parquets, total=len(parquets)):
    ds = load_dataset("...", data_files=parq)
    ds.save_to_disk("...", fs=hdfs)

Error:
Running the above script results in a datasets.exceptions.NonMatchingSplitsSizesError. It seems that saving the dataset requires it to be identical to the entire split size.

Questions:

  1. Is there a way to use save_to_disk without downloading the entire dataset first?
  2. Are there alternative methods to download a large dataset directly to HDFS?

Any insights or suggestions would be greatly appreciated. Thank you!