Three-way Random Split

Hi there,
I am wondering, what is currently the most elegant way to perform a three-way random split (into train, val and test set)? Let’s assume I load_dataset so that:

Dataset({
    features: ['text'],
    num_rows: 19122
})

Subsequently, I’d like to perform the split. Currently I am performing dataset.train_test_split() twice and then recombine the three datasets into one using DatasetDict. However, I assume that this is not the most elegant approach right? I also experimented with ReadInstructions, however, I could only split the data deterministically instead of randomly…
Any one got a better soultion? :slight_smile:

cc @lhoestq

We plan to add a way to define additional splits that just train and test in train_test_split.
For now you’d have to use it twice as you mentioned (or use a combination of Dataset.shuffle and Dataset.shard/select).
See the issue about extending train_test_split here

1 Like