"FileNotFoundError: [Errno 2] No such file or directory" when loading custom split dataset from hub

Hi, I made a custom train/valid/test split dataset of the VCTK dataset after downsampling it to 16khz and taking only the first 44,455 files (the mic1 files) as follows:

from datasets import load_dataset

vctk = load_dataset("vctk")
mic1_vctk = vctk.filter(lambda e, i: i<44455, with_indices=True)

# downsample dataset sampling rate to 16khz
from datasets import Audio

mic1_vctk = mic1_vctk.cast_column("audio", Audio(sampling_rate=16_000))

# split dataset into train/valid/test 80/10/10
from datasets import DatasetDict, dataset_dict

# first, 80% train, 20% test + valid
train_test_vctk = mic1_vctk["train"].train_test_split(shuffle = True, seed = 200, test_size=0.2)
# split 20% test + valid in half
test_valid_vctk = train_test_vctk['test'].train_test_split(shuffle = True, seed = 200, test_size=0.50)
# gather for single DatasetDict
train_test_val_dataset_vctk = DatasetDict({
    'train': train_test_vctk['train'],
    'test': test_valid_vctk['test'],
    'dev': test_valid_vctk['train']})

I then pushed train_test_val_dataset_vctk to a private dataset repo on the hub.

train_test_val_dataset_vctk.push_to_hub("REPO_NAME")

Three parquets corresponding to each split was uploaded to the hub, as well as a dataset_infos.json file. However, when I try to reload the dataset that was pushed to the hub I get a “FileNotFoundError: [Errno 2] No such file or directory” for the .flac files when trying to preview the dataset:

dataset = load_dataset("seraphina/REPO_NAME", use_auth_token=True)

dataset["train"][0]

FileNotFoundError: [Errno 2] No such file or directory: ‘/root/.cache/huggingface/datasets/downloads/extracted/11c6712f5ee425ed0d423d179eab7ae93e392fe0e12540a1aa9929325a31ac71/wav48_silence_trimmed/p247/p247_202_mic1.flac’

How can I split the original VCTK dataset and ensure that all files are saved to the hub for me to reload in the future?

Hi! I tested this code with the most recent release of datasets, and it works as expected. Can you please update your installation of datasets (pip install -U datasets) and try again?

Hi! I’ve tried again but now I am unable to push the dataset to the hub as my Google colab crashes after using all available RAM during the push process. Is there a recommended way to avoid using up all the RAM while running push_to_hub?

Each shard is brought in memory before uploading. You can try to reduce max_shard_size in push_to_hub(). The default is 500MB

I trained the model in Colab and used pickle.dump to save the model file. The next day, I tried to load the model file, and I get the same error. os.getcwd() shows the file exists. I don’t know why Python goes through a temporary folder, ,‘/root/.cache/huggingface/datasets/csv/outpatient-edea352f28a0221a/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/csv-train.arrow’, which is addressed in the error message. It seems when the runtime is restarted some temporary folders and files are lost and Python tries to catch them.

Here is the error message:
‘/root/.cache/huggingface/datasets/csv/outpatient-edea352f28a0221a/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/csv-train.arrow’

If I retrain the model and save it, this time, pickle.load works without any error.