Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)`

When I load a folder structure containing multiple test sets (test1, test2) and a train set like the below, using ds = load_dataset("audiofolder", data_dir="/path/to/directory"), the resulting DatasetDict has a single test set that is the combined test1 and test2 sets.

How do I get both sets back separately?

The folder structure:

.
├── test1
|   ├── 01.mp3
|   ├── 02.mp3
|   ├── ...
│   └── metadata.csv
├── test2
|   ├── 01.mp3
|   ├── 02.mp3
|   ├── ...
│   └── metadata.csv
└── train
    ├── 01.mp3
    ├── 02.mp3
    ├── ...
    └── metadata.csv

The resulting DatasetDict with a single “merged” test set:

>>> load_dataset("audiofolder", data_dir="/path/to/directory")`
DatasetDict({
    train: Dataset({
        features: ['audio', 'case_id', 'segment_id', 'created_at', 'submitted_at', 'text', 'start_audio_s', 'end_audio_s', 'speaker_label', 'location_label'],
        num_rows: 9984
    })
    test: Dataset({
        features: ['audio', 'case_id', 'segment_id', 'created_at', 'submitted_at', 'text', 'start_audio_s', 'end_audio_s', 'speaker_label', 'location_label'],
        num_rows: 742
    })
})
1 Like

May not be possible. There is only a set number of keywords that are automatically detected by the load_dataset method. It will merge partial names like test1 and test2 into a single test class.

see

if you only have 3 categories = consider renaming the folders as test, validate and train. Each is treated separately by dataloader. Otherwise fallback on something like

import os
path_to_dir='/path/to/directory'
load_dataset("audiofolder", data_paths=[{folder: f'{path_to_dir}/{folder}/metadata.csv'} for folder in os.listdir(path_to_dir)])

Ah, beautiful, thank you!

Parenthetically, it seems pretty straight forward to generalise the “detection” of subfolder splits to arbitrary names, does it not? And the Trainer gets told explicitly which sets are for training and evaluation (and supports multiple evaluation sets via a DatasetDict for evaluation), so that seems compatible with such a change.

@panigrah correct, and you can also pass glob patterns:

load_dataset("audiofolder", data_files={
    "train": "path/to/train/*",
    "test1": "path/to/test1/*",
    "test2": "path/to/test2/*",
})
1 Like

Yes indeed. My guess is it was designed for specific purpose of only supporting the three “standard” splits as that may be how it’s used most frequently.