Create own dataset of train and test in separate folders

Hi, a couple of questions:
1- I have a folder for training consisting of thousands of mp3 files, and a mapping.csv that has the path + the transcription I also have another file called test with thousands of files and a mapping csv that consists of the path + the transcription.

I’m creating a dataset from local files but I want to specify that the train data is for training and test data are for tests when I’m fine-tuning using the newly created dataset.

In the documentation it’s not clear how they’re separated or how do I label a folder as train and the other as test.

audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
audio_dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': 'path/to/audio_1',
 'sampling_rate': 16000}

Please help here

1 Like

Hi ! This sounds related to Misunderstanding around creating audio datasets from Local files - #2 by lhoestq :slight_smile: