"FileNotFoundError: [Errno 2] No such file or directory" when loading custom split dataset from hub

Hi, I made a custom train/valid/test split dataset of the VCTK dataset after downsampling it to 16khz and taking only the first 44,455 files (the mic1 files) as follows:

from datasets import load_dataset

vctk = load_dataset("vctk")
mic1_vctk = vctk.filter(lambda e, i: i<44455, with_indices=True)

# downsample dataset sampling rate to 16khz
from datasets import Audio

mic1_vctk = mic1_vctk.cast_column("audio", Audio(sampling_rate=16_000))

# split dataset into train/valid/test 80/10/10
from datasets import DatasetDict, dataset_dict

# first, 80% train, 20% test + valid
train_test_vctk = mic1_vctk["train"].train_test_split(shuffle = True, seed = 200, test_size=0.2)
# split 20% test + valid in half
test_valid_vctk = train_test_vctk['test'].train_test_split(shuffle = True, seed = 200, test_size=0.50)
# gather for single DatasetDict
train_test_val_dataset_vctk = DatasetDict({
    'train': train_test_vctk['train'],
    'test': test_valid_vctk['test'],
    'dev': test_valid_vctk['train']})

I then pushed train_test_val_dataset_vctk to a private dataset repo on the hub.


Three parquets corresponding to each split was uploaded to the hub, as well as a dataset_infos.json file. However, when I try to reload the dataset that was pushed to the hub I get a “FileNotFoundError: [Errno 2] No such file or directory” for the .flac files when trying to preview the dataset:

dataset = load_dataset("seraphina/REPO_NAME", use_auth_token=True)


FileNotFoundError: [Errno 2] No such file or directory: ‘/root/.cache/huggingface/datasets/downloads/extracted/11c6712f5ee425ed0d423d179eab7ae93e392fe0e12540a1aa9929325a31ac71/wav48_silence_trimmed/p247/p247_202_mic1.flac’

How can I split the original VCTK dataset and ensure that all files are saved to the hub for me to reload in the future?

Hi! I tested this code with the most recent release of datasets, and it works as expected. Can you please update your installation of datasets (pip install -U datasets) and try again?

Hi! I’ve tried again but now I am unable to push the dataset to the hub as my Google colab crashes after using all available RAM during the push process. Is there a recommended way to avoid using up all the RAM while running push_to_hub?

Each shard is brought in memory before uploading. You can try to reduce max_shard_size in push_to_hub(). The default is 500MB