Not declaring splits inside of dataset loading script

aclifton314 · July 27, 2022, 10:09pm

Python: 3.9.7
Datsets: 2.1.0

I have a large dataset that I have opted to create by writing a dataset loading script following these instructions. Unfortunately, my workflow is constrained to loading the dataset and then splitting it into train and test sets. Initially when I load the dataset, I’d like it to not have any splits:

>>> my_ds
>>> Dataset({
            features: [label', 'data'],
            num_rows: 1000000
        })

and then I would like to call train_test_split() on it:

final_ds = my_dataset.train_test_split(train_size=0.9, shuffle=True)

I noticed that in the instructions listed above, there is a _split_generators method that segments the data into individual splits. Something like:

def _split_generators(self, dl_manager: DownloadManager) -> List[datasets.SplitGenerator]:
        urls_to_use = _URLS
        downloaded_files = dl_manager.download_and_extract(urls_to_use)
        
        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={'filepath': downloaded_files['my_file']})
        ]

Using the above _split_generators(), I can successfully run the loading script and create the dataset, however it has a train split associated with it. Is there a way to write the loading script so it doesn’t create a split in it as shown above in the dataset with no split?

Thanks in advance for your help and guidance!!!

lhoestq · July 28, 2022, 10:03am

Hi ! Having at least one split is a constraint we have in the datasets lib. We may allow dataset that don’t have the notion of splits but this is not supported yet. For now you can just keep it as “train”, or define a “full” split:

        return [
            datasets.SplitGenerator(name="full", gen_kwargs={'filepath': downloaded_files['my_file']})
        ]

aclifton314 · July 28, 2022, 3:44pm

@lhoestq Perfect! I’ll incorporate this into my workflow and make the necessary changes!

Topic		Replies	Views
Splitting Dataset in the dataset loading script 🤗Datasets	1	600	September 16, 2022
Dataset with no splits 🤗Datasets	4	3464	May 16, 2024
Load_dataset assumes 'train' Beginners	2	932	May 31, 2023
`train_test_split` with IterableDataset 🤗Datasets	2	1812	January 26, 2023
`load_dataset`: how to extract only the validation split? 🤗Datasets	2	1274	March 15, 2023

Not declaring splits inside of dataset loading script

Related topics