Specifying K-fold splits in a dataset

Hi,

We have a dataset where our main evaluation metrics are reported via a k-fold cross evaluation + a small, fixed, holdout set. For example, we have 5-10% of the data that is hand selected as a “gold standard” for testing, and we do a 5-fold split of the remaining 90% as the dataset isn’t that large, so that we can train on more images. What is the best/canonical way to share this sort of split using datasets? We’d like to distribute the exact splits we used because they’re stratified based on certain attributes of the sample and people ought to be able to replicate results (while also having the option to define their own splits).

Ideally it would be nice to have separation between e.g. labels (which are large) and filename lists (which are small) without having to write a custom load function. It seems like custom loaders are semi-deprecated due to the risk of malicious code execution (i.e. users must opt in), so a purely config-based setup would be best.

The brute force solution would seem to be a bunch of label metadata files train_kfold_x.jsonl, but then we have to duplicate the annotation files 10 or more times.

Perhaps a more generic question is ‘How can one specify that a sample is in multiple splits, without duplicating annotation metadata’?

It seems like the select function might be a good way to do this, if we provide lists of indices for each split? And then, what’s the best way to distribute those indices with the dataset?

Thanks!

You can indeed use .select() with the train/validation sets indices :slight_smile:

You can define one configuration of your dataset that would contain the data

configs:
- config_name: default
  data_files: train.jsonl

and one config for the indices

- config_name: "kfold_indices"
  data_files: indices.jsonl

This YAML configuration can be placed in the YAML header at the top of the README.md.

See the documentation on Data Files Configuration here: Manual Configuration