Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)`

Jaksen93 · November 11, 2023, 6:13pm

When I load a folder structure containing multiple test sets (test1, test2) and a train set like the below, using ds = load_dataset("audiofolder", data_dir="/path/to/directory"), the resulting DatasetDict has a single test set that is the combined test1 and test2 sets.

How do I get both sets back separately?

The folder structure:

.
├── test1
|   ├── 01.mp3
|   ├── 02.mp3
|   ├── ...
│   └── metadata.csv
├── test2
|   ├── 01.mp3
|   ├── 02.mp3
|   ├── ...
│   └── metadata.csv
└── train
    ├── 01.mp3
    ├── 02.mp3
    ├── ...
    └── metadata.csv

The resulting DatasetDict with a single “merged” test set:

>>> load_dataset("audiofolder", data_dir="/path/to/directory")`
DatasetDict({
    train: Dataset({
        features: ['audio', 'case_id', 'segment_id', 'created_at', 'submitted_at', 'text', 'start_audio_s', 'end_audio_s', 'speaker_label', 'location_label'],
        num_rows: 9984
    })
    test: Dataset({
        features: ['audio', 'case_id', 'segment_id', 'created_at', 'submitted_at', 'text', 'start_audio_s', 'end_audio_s', 'speaker_label', 'location_label'],
        num_rows: 742
    })
})

panigrah · November 12, 2023, 8:22am

May not be possible. There is only a set number of keywords that are automatically detected by the load_dataset method. It will merge partial names like test1 and test2 into a single test class.

see

if you only have 3 categories = consider renaming the folders as test, validate and train. Each is treated separately by dataloader. Otherwise fallback on something like

import os
path_to_dir='/path/to/directory'
load_dataset("audiofolder", data_paths=[{folder: f'{path_to_dir}/{folder}/metadata.csv'} for folder in os.listdir(path_to_dir)])

Jaksen93 · November 12, 2023, 11:52am

Ah, beautiful, thank you!

Parenthetically, it seems pretty straight forward to generalise the “detection” of subfolder splits to arbitrary names, does it not? And the Trainer gets told explicitly which sets are for training and evaluation (and supports multiple evaluation sets via a DatasetDict for evaluation), so that seems compatible with such a change.

lhoestq · November 13, 2023, 11:21am

@panigrah correct, and you can also pass glob patterns:

load_dataset("audiofolder", data_files={
    "train": "path/to/train/*",
    "test1": "path/to/test1/*",
    "test2": "path/to/test2/*",
})

panigrah · November 13, 2023, 12:01pm

Yes indeed. My guess is it was designed for specific purpose of only supporting the three “standard” splits as that may be how it’s used most frequently.

Topic		Replies	Views
Loading train and test splits with `audiofolder` 🤗Datasets	5	1717	February 10, 2024
Load_dataset assumes 'train' Beginners	2	957	May 31, 2023
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1807	August 12, 2022
Loading Dataset with custom splits 🤗Datasets	1	546	July 12, 2023
Dataset subsets with default Dataloader 🤗Datasets	2	328	October 25, 2022

Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)`

Related topics