How to create a train test split for an iterable dataset

Just curious- how do I create a train test split from a dataset that doesn’t have a length function? I don’t want to download & tokenize the whole dataset before I split it into training and testing.

Hi! I think the only option is to sample the input dataset while iterating over it (e.g., in the training loop) to generate the test split.

1 Like