Loading train and test splits with `audiofolder`

Hi folks,

I’m using the new audiofolder feature (docs) to load audio files from this Kaggle dataset. My folder structure has the following form:

.
β”œβ”€β”€ test
|   β”œβ”€β”€ 01.ogg
|   β”œβ”€β”€ 02.ogg
|   β”œβ”€β”€ ...
β”‚   └── metadata.csv
└── train
    β”œβ”€β”€ 01.ogg
    β”œβ”€β”€ 02.ogg
    β”œβ”€β”€ ...
    └── metadata.csv

In particular, the train and test splits have different features. Now, when I try to load both splits at once with:

from datasets import load_dataset

ds = load_dataset("audiofolder", data_dir="path/to/dir")

I get the following error:

ValueError: Metadata files /Users/lewtun/Downloads/kaggle-pog-series-s01e02/test/metadata.csv and /Users/lewtun/Downloads/kaggle-pog-series-s01e02/train/metadata.csv have different features: ('/Users/lewtun/Downloads/kaggle-pog-series-s01e02/train/metadata.csv', {'song_id': Value(dtype='int64', id=None), 'file_name': Value(dtype='string', id=None), 'filepath': Value(dtype='string', id=None), 'genre_id': Value(dtype='int64', id=None), 'genre': Value(dtype='string', id=None)}) != {'song_id': Value(dtype='int64', id=None), 'file_name': Value(dtype='string', id=None), 'filepath': Value(dtype='string', id=None)}

This suggests that audiofolder is trying to combine the train and test splits as a single dataset - is this correct?

I am able to load each split separately, but my question is: can audiofolder load multiple splits as a DatasetDict object?

For ImageFolder, it works as follows:

dataset = load_dataset("imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]})

So I assume it’s similar for AudioFolder.

1 Like

Thanks @nielsr !

Your suggestion works provided I don’t include metadata.csv in the data_files dictionary, e.g. the following works:

ds = load_dataset("audiofolder", data_files={"train": ["train/01.ogg"], "test": ["test/01.ogg"]})

but the following fails (with the same error as above):

ds = load_dataset("audiofolder", data_files={"train": ["train/metadata.csv", "train/01.ogg"], "test": ["test/metadata.csv", "test/01.ogg"]})

I think the key issue here is that I want to load the audio files with metadata and audiofolder seems to only support loading a single folder at a time.

hi @lewtun !
yes, audiofolder can load multiple splits as a DatasetDict object :slight_smile: they will be inferred from the directory structure or file names.
The problem here is that if more then one metadata files are found and they contain different set of features, you would have an error with both data_dir and data_files approach - this is an intentional behavior. In this case, you can check that without metadata it would work by setting drop_metadata=True - you will have two splits in a DatasetDict object.

for now the only option to load two splits with different set of features with Audio/ImageFolder is to load them as separate datasets:

ds_train = load_dataset("audiofolder", data_dir="path/to/dir/train")
ds_test = load_dataset("audiofolder", data_files={"test": "path/to/dir/test/**"})

*you need to use data_files if you want to give a specific name to a split. By default if no split structure is found, all the data goes to the β€œtrain” split. The relevant code is here: datasets/data_files.py at main Β· huggingface/datasets Β· GitHub

1 Like

All the splits of a dataset must have the same features.

In your test set you can add the missing features and set them to None if you want

1 Like