How to create an audio dataset from local files already split into train and test without losing labels

If I have audio clips already split into train and test sets, what is the easiest way to create a dataset for from local files? I have looked at Misunderstanding around creating audio datasets but all the steps I have tried the labels are always lost. Currently I have:

  • audio files all in one ‘/project/data’ folder
  • a test.csv file with headings audio_file and label containing the metadata for the test split in /project
  • a train.csv, formatted and located as for test.csv

I then tried:

project_dataset = load_dataset("/project", data_dir="data")
project_dataset
DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['audio'],
        num_rows: 369
    })
})
project_dataset['train'][0]
{'audio': {'path': 'C:\\project\\data\\train_0_50_1.wav',
  'array': array([-0.01461792, -0.0177002 , -0.009552  , ..., -0.03012085,
         -0.01861572, -0.01263428]),
  'sampling_rate': 16000}}

Superficially this works - it is creating the separate splits with the correct files in them so must be reading the metadata files, but the labels from the metadata are lost. Where are these being lost and how can I retain them? Thanks for help.

Simon

I eventually got this to work with all audio in a single ‘data’ folder, a single metadata.csv file in the /project folder. The metadata,csv contained both test and train samples and column headings: file_name and label (not audio_file as recommended elsewhere), then:

load_dataset(“audiofolder”, data_dir = ‘/project’)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.