Hi,
We have a dataset where our main evaluation metrics are reported via a k-fold cross evaluation + a small, fixed, holdout set. For example, we have 5-10% of the data that is hand selected as a âgold standardâ for testing, and we do a 5-fold split of the remaining 90% as the dataset isnât that large, so that we can train on more images. What is the best/canonical way to share this sort of split using datasets
? Weâd like to distribute the exact splits we used because theyâre stratified based on certain attributes of the sample and people ought to be able to replicate results (while also having the option to define their own splits).
Ideally it would be nice to have separation between e.g. labels (which are large) and filename lists (which are small) without having to write a custom load function. It seems like custom loaders are semi-deprecated due to the risk of malicious code execution (i.e. users must opt in), so a purely config-based setup would be best.
The brute force solution would seem to be a bunch of label metadata files train_kfold_x.jsonl
, but then we have to duplicate the annotation files 10 or more times.
Perhaps a more generic question is âHow can one specify that a sample is in multiple splits, without duplicating annotation metadataâ?
It seems like the select
function might be a good way to do this, if we provide lists of indices for each split? And then, whatâs the best way to distribute those indices with the dataset?
Thanks!