Splitting Dataset in the dataset loading script

I’m defining my own dataset. To do this I follow the tutorial of the docs, and create a dataset loading script. (see the docs)

But I’m facing an issue : my data is located in a single file, and I would like to split this data into train and test subsets.

As far as I understand, it’s not possible.
In _split_generators() method, since I have a single file, I can assign it only to a single SplitGenerator


As an alternative, I made a single split in my dataset loading script, and tried to call train_test_split() deterministically, but even when fixing the random seed, it gives different results everytime…

PS : I know I could just split my single file, unfortunately I don’t have control over that file…

Hi! You can define two SplitGenerator objects, one for train and one for test, and pass that file to each of them, and implement the splitting in _generate_examples.

The code skeleton you can use:

def _split_generators(self, dl_manager):
    ...
    return [
        datasets.SplitGenerator(name="train", gen_kwargs={"data_file": data_file, "split":"train"}),
        atasets.SplitGenerator(name="test", gen_kwargs={"data_file": data_file, "split":"test"})
    ]

def _generate_examples(self, data_file, split):
    # split data based on the `split` value
1 Like