Splitting Dataset in the dataset loading script

Hi! You can define two SplitGenerator objects, one for train and one for test, and pass that file to each of them, and implement the splitting in _generate_examples.

The code skeleton you can use:

def _split_generators(self, dl_manager):
    ...
    return [
        datasets.SplitGenerator(name="train", gen_kwargs={"data_file": data_file, "split":"train"}),
        atasets.SplitGenerator(name="test", gen_kwargs={"data_file": data_file, "split":"test"})
    ]

def _generate_examples(self, data_file, split):
    # split data based on the `split` value
1 Like