lewtun
September 2, 2022, 9:49am
1
Hi folks,
Iβm using the new audiofolder
feature (docs ) to load audio files from this Kaggle dataset . My folder structure has the following form:
.
βββ test
| βββ 01.ogg
| βββ 02.ogg
| βββ ...
β βββ metadata.csv
βββ train
βββ 01.ogg
βββ 02.ogg
βββ ...
βββ metadata.csv
In particular, the train and test splits have different features. Now, when I try to load both splits at once with:
from datasets import load_dataset
ds = load_dataset("audiofolder", data_dir="path/to/dir")
I get the following error:
ValueError: Metadata files /Users/lewtun/Downloads/kaggle-pog-series-s01e02/test/metadata.csv and /Users/lewtun/Downloads/kaggle-pog-series-s01e02/train/metadata.csv have different features: ('/Users/lewtun/Downloads/kaggle-pog-series-s01e02/train/metadata.csv', {'song_id': Value(dtype='int64', id=None), 'file_name': Value(dtype='string', id=None), 'filepath': Value(dtype='string', id=None), 'genre_id': Value(dtype='int64', id=None), 'genre': Value(dtype='string', id=None)}) != {'song_id': Value(dtype='int64', id=None), 'file_name': Value(dtype='string', id=None), 'filepath': Value(dtype='string', id=None)}
This suggests that audiofolder
is trying to combine the train and test splits as a single dataset - is this correct?
I am able to load each split separately, but my question is: can audiofolder
load multiple splits as a DatasetDict
object?
nielsr
September 2, 2022, 9:56am
2
For ImageFolder
, it works as follows:
dataset = load_dataset("imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]})
So I assume itβs similar for AudioFolder
.
1 Like
lewtun
September 2, 2022, 10:09am
3
Thanks @nielsr !
Your suggestion works provided I donβt include metadata.csv
in the data_files
dictionary, e.g. the following works:
ds = load_dataset("audiofolder", data_files={"train": ["train/01.ogg"], "test": ["test/01.ogg"]})
but the following fails (with the same error as above):
ds = load_dataset("audiofolder", data_files={"train": ["train/metadata.csv", "train/01.ogg"], "test": ["test/metadata.csv", "test/01.ogg"]})
I think the key issue here is that I want to load the audio files with metadata and audiofolder
seems to only support loading a single folder at a time.
hi @lewtun !
yes, audiofolder
can load multiple splits as a DatasetDict
object they will be inferred from the directory structure or file names.
The problem here is that if more then one metadata files are found and they contain different set of features, you would have an error with both data_dir
and data_files
approach - this is an intentional behavior. In this case, you can check that without metadata it would work by setting drop_metadata=True
- you will have two splits in a DatasetDict
object.
for now the only option to load two splits with different set of features with Audio/ImageFolder is to load them as separate datasets:
ds_train = load_dataset("audiofolder", data_dir="path/to/dir/train")
ds_test = load_dataset("audiofolder", data_files={"test": "path/to/dir/test/**"})
*you need to use data_files
if you want to give a specific name to a split. By default if no split structure is found, all the data goes to the βtrainβ split. The relevant code is here: datasets/data_files.py at main Β· huggingface/datasets Β· GitHub
1 Like
lhoestq
September 5, 2022, 8:57am
5
All the splits of a dataset must have the same features.
In your test set you can add the missing features and set them to None if you want
1 Like
my workaround is to merge 2 metadata.csv
into only 1 and put at root