Three-way Random Split

simonschoe · March 19, 2021, 7:18am

Hi there,
I am wondering, what is currently the most elegant way to perform a three-way random split (into train, val and test set)? Let’s assume I load_dataset so that:

Dataset({
    features: ['text'],
    num_rows: 19122
})

Subsequently, I’d like to perform the split. Currently I am performing dataset.train_test_split() twice and then recombine the three datasets into one using DatasetDict. However, I assume that this is not the most elegant approach right? I also experimented with ReadInstructions, however, I could only split the data deterministically instead of randomly…
Any one got a better soultion?

sgugger · March 19, 2021, 12:54pm

cc @lhoestq

lhoestq · March 19, 2021, 1:39pm

We plan to add a way to define additional splits that just train and test in train_test_split.
For now you’d have to use it twice as you mentioned (or use a combination of Dataset.shuffle and Dataset.shard/select).
See the issue about extending train_test_split here

Topic		Replies	Views
`train_test_split` with IterableDataset 🤗Datasets	2	1815	January 26, 2023
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42591	May 23, 2024
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5731	August 12, 2022
Reverse instances in a Dataset 🤗Datasets	1	596	August 23, 2021
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	552	February 17, 2025

Three-way Random Split

Related topics