`train_test_split` with IterableDataset

Hi all,

Is it possible to use or add a feature to IterableDatasets to have a train_test_split, similar to the feature here?

Currently if there’s no train-test-split specified for a dataset (especially a large one), I would have to update the dataset script manually and hack it such that I could define the first X% of the files as train and the rest of split.

I’m guessing that if it’s not currently possible, it might be tricky to implement?

1 Like

Yes it’s not implemented right now but it should be possible to implement a train_test_split over the dataset shards. Contributions are welcome though if you’re interested in helping on this matter :slight_smile:

For now I’d suggest you to define two separate datasets, one with the train data files and one with the test data files

Makes sense! Currently, the dataset I’ve been using is only one file but I didn’t want to do 5+ mins of processing ahead of time especially if I had to do this on every machine. I ended up adding a dataset loading script to randomly select columns on the fly (a little hacky but works for now)