How to create an audio dataset from local files already split into train and test without losing labels

SimonGillings · March 17, 2024, 1:17pm

If I have audio clips already split into train and test sets, what is the easiest way to create a dataset for from local files? I have looked at Misunderstanding around creating audio datasets but all the steps I have tried the labels are always lost. Currently I have:

audio files all in one ‘/project/data’ folder
a test.csv file with headings audio_file and label containing the metadata for the test split in /project
a train.csv, formatted and located as for test.csv

I then tried:

project_dataset = load_dataset("/project", data_dir="data")
project_dataset
DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['audio'],
        num_rows: 369
    })
})
project_dataset['train'][0]
{'audio': {'path': 'C:\\project\\data\\train_0_50_1.wav',
  'array': array([-0.01461792, -0.0177002 , -0.009552  , ..., -0.03012085,
         -0.01861572, -0.01263428]),
  'sampling_rate': 16000}}

Superficially this works - it is creating the separate splits with the correct files in them so must be reading the metadata files, but the labels from the metadata are lost. Where are these being lost and how can I retain them? Thanks for help.

Simon

SimonGillings · March 17, 2024, 9:50pm

I eventually got this to work with all audio in a single ‘data’ folder, a single metadata.csv file in the /project folder. The metadata,csv contained both test and train samples and column headings: file_name and label (not audio_file as recommended elsewhere), then:

load_dataset(“audiofolder”, data_dir = ‘/project’)

system · March 18, 2024, 9:51am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Create own dataset of train and test in separate folders 🤗Datasets	1	773	January 26, 2023
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1756	July 17, 2023
Audio dataset without uploading the data to the hub 🤗Datasets	6	1957	March 20, 2023
Is it possible to reuse only part of an already loaded audio dataset? Beginners	0	66	June 14, 2024
How does one actually create a new dataset? Beginners	2	3270	October 18, 2024

How to create an audio dataset from local files already split into train and test without losing labels

Related topics