`train_test_split` with IterableDataset

zpn · January 17, 2023, 3:00am

Hi all,

Is it possible to use or add a feature to IterableDatasets to have a train_test_split, similar to the feature here?

Currently if there’s no train-test-split specified for a dataset (especially a large one), I would have to update the dataset script manually and hack it such that I could define the first X% of the files as train and the rest of split.

I’m guessing that if it’s not currently possible, it might be tricky to implement?

lhoestq · January 23, 2023, 1:48pm

Yes it’s not implemented right now but it should be possible to implement a train_test_split over the dataset shards. Contributions are welcome though if you’re interested in helping on this matter

For now I’d suggest you to define two separate datasets, one with the train data files and one with the test data files

zpn · January 26, 2023, 4:40am

Makes sense! Currently, the dataset I’ve been using is only one file but I didn’t want to do 5+ mins of processing ahead of time especially if I had to do this on every machine. I ended up adding a dataset loading script to randomly select columns on the fly (a little hacky but works for now)

Topic		Replies	Views
How to create a train test split for an iterable dataset 🤗Datasets	1	1264	June 6, 2023
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	518	February 17, 2025
Load_dataset split='test' not working 🤗Datasets	2	883	February 8, 2024
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42456	May 23, 2024
Three-way Random Split 🤗Datasets	2	2354	March 19, 2021

`train_test_split` with IterableDataset

Related topics