Specifying K-fold splits in a dataset

joshvm · March 19, 2024, 9:12pm

Hi,

We have a dataset where our main evaluation metrics are reported via a k-fold cross evaluation + a small, fixed, holdout set. For example, we have 5-10% of the data that is hand selected as a “gold standard” for testing, and we do a 5-fold split of the remaining 90% as the dataset isn’t that large, so that we can train on more images. What is the best/canonical way to share this sort of split using datasets? We’d like to distribute the exact splits we used because they’re stratified based on certain attributes of the sample and people ought to be able to replicate results (while also having the option to define their own splits).

Ideally it would be nice to have separation between e.g. labels (which are large) and filename lists (which are small) without having to write a custom load function. It seems like custom loaders are semi-deprecated due to the risk of malicious code execution (i.e. users must opt in), so a purely config-based setup would be best.

The brute force solution would seem to be a bunch of label metadata files train_kfold_x.jsonl, but then we have to duplicate the annotation files 10 or more times.

Perhaps a more generic question is ‘How can one specify that a sample is in multiple splits, without duplicating annotation metadata’?

It seems like the select function might be a good way to do this, if we provide lists of indices for each split? And then, what’s the best way to distribute those indices with the dataset?

Thanks!

lhoestq · March 20, 2024, 4:50pm

You can indeed use .select() with the train/validation sets indices

You can define one configuration of your dataset that would contain the data

configs:
- config_name: default
  data_files: train.jsonl

and one config for the indices

- config_name: "kfold_indices"
  data_files: indices.jsonl

This YAML configuration can be placed in the YAML header at the top of the README.md.

See the documentation on Data Files Configuration here: Manual Configuration

Topic		Replies	Views
K fold cross validation Beginners	5	12935	July 29, 2023
Dataset subsets with default Dataloader 🤗Datasets	2	321	October 25, 2022
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5725	August 12, 2022
`train_test_split` with IterableDataset 🤗Datasets	2	1796	January 26, 2023
Percent slicing and rounding + Stratify Beginners	1	440	June 19, 2023

Specifying K-fold splits in a dataset

Related topics