Splitting Dataset in the dataset loading script

mariosasko · September 16, 2022, 3:11pm

Hi! You can define two SplitGenerator objects, one for train and one for test, and pass that file to each of them, and implement the splitting in _generate_examples.

The code skeleton you can use:

def _split_generators(self, dl_manager):
    ...
    return [
        datasets.SplitGenerator(name="train", gen_kwargs={"data_file": data_file, "split":"train"}),
        atasets.SplitGenerator(name="test", gen_kwargs={"data_file": data_file, "split":"test"})
    ]

def _generate_examples(self, data_file, split):
    # split data based on the `split` value

Topic		Replies	Views
Not declaring splits inside of dataset loading script 🤗Datasets	2	1575	July 28, 2022
Loader for dataset with multiple source files in one split 🤗Datasets	1	772	May 9, 2022
My dataset loading script is not working 🤗Datasets	3	848	September 15, 2022
Dataset with no splits 🤗Datasets	4	3403	May 16, 2024
`train_test_split` with IterableDataset 🤗Datasets	2	1757	January 26, 2023

Splitting Dataset in the dataset loading script

Related topics