Hi, I made a custom train/valid/test split dataset of the VCTK dataset after downsampling it to 16khz and taking only the first 44,455 files (the mic1 files) as follows:
from datasets import load_dataset
vctk = load_dataset("vctk")
mic1_vctk = vctk.filter(lambda e, i: i<44455, with_indices=True)
# downsample dataset sampling rate to 16khz
from datasets import Audio
mic1_vctk = mic1_vctk.cast_column("audio", Audio(sampling_rate=16_000))
# split dataset into train/valid/test 80/10/10
from datasets import DatasetDict, dataset_dict
# first, 80% train, 20% test + valid
train_test_vctk = mic1_vctk["train"].train_test_split(shuffle = True, seed = 200, test_size=0.2)
# split 20% test + valid in half
test_valid_vctk = train_test_vctk['test'].train_test_split(shuffle = True, seed = 200, test_size=0.50)
# gather for single DatasetDict
train_test_val_dataset_vctk = DatasetDict({
'train': train_test_vctk['train'],
'test': test_valid_vctk['test'],
'dev': test_valid_vctk['train']})
I then pushed train_test_val_dataset_vctk to a private dataset repo on the hub.
train_test_val_dataset_vctk.push_to_hub("REPO_NAME")
Three parquets corresponding to each split was uploaded to the hub, as well as a dataset_infos.json file. However, when I try to reload the dataset that was pushed to the hub I get a âFileNotFoundError: [Errno 2] No such file or directoryâ for the .flac files when trying to preview the dataset:
dataset = load_dataset("seraphina/REPO_NAME", use_auth_token=True)
dataset["train"][0]
FileNotFoundError: [Errno 2] No such file or directory: â/root/.cache/huggingface/datasets/downloads/extracted/11c6712f5ee425ed0d423d179eab7ae93e392fe0e12540a1aa9929325a31ac71/wav48_silence_trimmed/p247/p247_202_mic1.flacâ
How can I split the original VCTK dataset and ensure that all files are saved to the hub for me to reload in the future?