Splitting Dataset in the dataset loading script

astariul · September 16, 2022, 7:18am

I’m defining my own dataset. To do this I follow the tutorial of the docs, and create a dataset loading script. (see the docs)

But I’m facing an issue : my data is located in a single file, and I would like to split this data into train and test subsets.

As far as I understand, it’s not possible.
In _split_generators() method, since I have a single file, I can assign it only to a single SplitGenerator…

As an alternative, I made a single split in my dataset loading script, and tried to call train_test_split() deterministically, but even when fixing the random seed, it gives different results everytime…

PS : I know I could just split my single file, unfortunately I don’t have control over that file…

mariosasko · September 16, 2022, 3:11pm

Hi! You can define two SplitGenerator objects, one for train and one for test, and pass that file to each of them, and implement the splitting in _generate_examples.

The code skeleton you can use:

def _split_generators(self, dl_manager):
    ...
    return [
        datasets.SplitGenerator(name="train", gen_kwargs={"data_file": data_file, "split":"train"}),
        atasets.SplitGenerator(name="test", gen_kwargs={"data_file": data_file, "split":"test"})
    ]

def _generate_examples(self, data_file, split):
    # split data based on the `split` value

Topic		Replies	Views
Not declaring splits inside of dataset loading script 🤗Datasets	2	1596	July 28, 2022
Loader for dataset with multiple source files in one split 🤗Datasets	1	783	May 9, 2022
My dataset loading script is not working 🤗Datasets	3	849	September 15, 2022
Dataset with no splits 🤗Datasets	4	3462	May 16, 2024
`train_test_split` with IterableDataset 🤗Datasets	2	1809	January 26, 2023

Splitting Dataset in the dataset loading script

Related topics