If I have audio clips already split into train and test sets, what is the easiest way to create a dataset for from local files? I have looked at Misunderstanding around creating audio datasets but all the steps I have tried the labels are always lost. Currently I have:
- audio files all in one ‘/project/data’ folder
- a test.csv file with headings audio_file and label containing the metadata for the test split in /project
- a train.csv, formatted and located as for test.csv
I then tried:
project_dataset = load_dataset("/project", data_dir="data")
project_dataset
DatasetDict({
train: Dataset({
features: ['audio'],
num_rows: 1000
})
test: Dataset({
features: ['audio'],
num_rows: 369
})
})
project_dataset['train'][0]
{'audio': {'path': 'C:\\project\\data\\train_0_50_1.wav',
'array': array([-0.01461792, -0.0177002 , -0.009552 , ..., -0.03012085,
-0.01861572, -0.01263428]),
'sampling_rate': 16000}}
Superficially this works - it is creating the separate splits with the correct files in them so must be reading the metadata files, but the labels from the metadata are lost. Where are these being lost and how can I retain them? Thanks for help.
Simon