Create own dataset of train and test in separate folders

asennoussi · January 25, 2023, 4:48pm

Hi, a couple of questions:
1- I have a folder for training consisting of thousands of mp3 files, and a mapping.csv that has the path + the transcription I also have another file called test with thousands of files and a mapping csv that consists of the path + the transcription.

I’m creating a dataset from local files but I want to specify that the train data is for training and test data are for tests when I’m fine-tuning using the newly created dataset.

In the documentation it’s not clear how they’re separated or how do I label a folder as train and the other as test.

audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
audio_dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': 'path/to/audio_1',
 'sampling_rate': 16000}

Please help here

lhoestq · January 26, 2023, 9:56am

Hi ! This sounds related to Misunderstanding around creating audio datasets from Local files - #2 by lhoestq

Topic		Replies	Views
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1756	July 17, 2023
How to create an audio dataset from local files already split into train and test without losing labels Beginners	2	401	March 17, 2024
Wav2vec2 pretraining on own wav files 🤗Datasets	2	1011	April 24, 2022
Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)` Beginners	4	771	November 13, 2023
Audio dataset without uploading the data to the hub 🤗Datasets	6	1957	March 20, 2023

Create own dataset of train and test in separate folders

Related topics