Loading train and test splits with `audiofolder`

lewtun · September 2, 2022, 9:49am

Hi folks,

I’m using the new audiofolder feature (docs) to load audio files from this Kaggle dataset. My folder structure has the following form:

.
├── test
|   ├── 01.ogg
|   ├── 02.ogg
|   ├── ...
│   └── metadata.csv
└── train
    ├── 01.ogg
    ├── 02.ogg
    ├── ...
    └── metadata.csv

In particular, the train and test splits have different features. Now, when I try to load both splits at once with:

from datasets import load_dataset

ds = load_dataset("audiofolder", data_dir="path/to/dir")

I get the following error:

ValueError: Metadata files /Users/lewtun/Downloads/kaggle-pog-series-s01e02/test/metadata.csv and /Users/lewtun/Downloads/kaggle-pog-series-s01e02/train/metadata.csv have different features: ('/Users/lewtun/Downloads/kaggle-pog-series-s01e02/train/metadata.csv', {'song_id': Value(dtype='int64', id=None), 'file_name': Value(dtype='string', id=None), 'filepath': Value(dtype='string', id=None), 'genre_id': Value(dtype='int64', id=None), 'genre': Value(dtype='string', id=None)}) != {'song_id': Value(dtype='int64', id=None), 'file_name': Value(dtype='string', id=None), 'filepath': Value(dtype='string', id=None)}

This suggests that audiofolder is trying to combine the train and test splits as a single dataset - is this correct?

I am able to load each split separately, but my question is: can audiofolder load multiple splits as a DatasetDict object?

nielsr · September 2, 2022, 9:56am

For ImageFolder, it works as follows:

dataset = load_dataset("imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]})

So I assume it’s similar for AudioFolder.

lewtun · September 2, 2022, 10:09am

Thanks @nielsr !

Your suggestion works provided I don’t include metadata.csv in the data_files dictionary, e.g. the following works:

ds = load_dataset("audiofolder", data_files={"train": ["train/01.ogg"], "test": ["test/01.ogg"]})

but the following fails (with the same error as above):

ds = load_dataset("audiofolder", data_files={"train": ["train/metadata.csv", "train/01.ogg"], "test": ["test/metadata.csv", "test/01.ogg"]})

I think the key issue here is that I want to load the audio files with metadata and audiofolder seems to only support loading a single folder at a time.

polinaeterna · September 2, 2022, 7:11pm

hi @lewtun !
yes, audiofolder can load multiple splits as a DatasetDict object they will be inferred from the directory structure or file names.
The problem here is that if more then one metadata files are found and they contain different set of features, you would have an error with both data_dir and data_files approach - this is an intentional behavior. In this case, you can check that without metadata it would work by setting drop_metadata=True - you will have two splits in a DatasetDict object.

for now the only option to load two splits with different set of features with Audio/ImageFolder is to load them as separate datasets:

ds_train = load_dataset("audiofolder", data_dir="path/to/dir/train")
ds_test = load_dataset("audiofolder", data_files={"test": "path/to/dir/test/**"})

*you need to use data_files if you want to give a specific name to a split. By default if no split structure is found, all the data goes to the “train” split. The relevant code is here: datasets/data_files.py at main · huggingface/datasets · GitHub

lhoestq · September 5, 2022, 8:57am

All the splits of a dataset must have the same features.

In your test set you can add the missing features and set them to None if you want

doof-ferb · February 10, 2024, 11:11am

my workaround is to merge 2 metadata.csv into only 1 and put at root

Topic		Replies	Views
Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)` Beginners	4	771	November 13, 2023
Why load_dataset on Audiofolder with metadata is returning Filenotfound error 🤗Datasets	6	1218	August 18, 2023
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1778	August 12, 2022
Missing one feature in dataset when loading from folder 🤗Datasets	2	572	October 31, 2023
Dataset loading script not working 🤗Datasets	2	430	August 31, 2023

Loading train and test splits with `audiofolder`

Related topics