Not declaring splits inside of dataset loading script

Python: 3.9.7
Datsets: 2.1.0

I have a large dataset that I have opted to create by writing a dataset loading script following these instructions. Unfortunately, my workflow is constrained to loading the dataset and then splitting it into train and test sets. Initially when I load the dataset, I’d like it to not have any splits:

>>> my_ds
>>> Dataset({
            features: [label', 'data'],
            num_rows: 1000000

and then I would like to call train_test_split() on it:

final_ds = my_dataset.train_test_split(train_size=0.9, shuffle=True)

I noticed that in the instructions listed above, there is a _split_generators method that segments the data into individual splits. Something like:

def _split_generators(self, dl_manager: DownloadManager) -> List[datasets.SplitGenerator]:
        urls_to_use = _URLS
        downloaded_files = dl_manager.download_and_extract(urls_to_use)
        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={'filepath': downloaded_files['my_file']})

Using the above _split_generators(), I can successfully run the loading script and create the dataset, however it has a train split associated with it. Is there a way to write the loading script so it doesn’t create a split in it as shown above in the dataset with no split?

Thanks in advance for your help and guidance!!!

Hi ! Having at least one split is a constraint we have in the datasets lib. We may allow dataset that don’t have the notion of splits but this is not supported yet. For now you can just keep it as “train”, or define a “full” split:

        return [
            datasets.SplitGenerator(name="full", gen_kwargs={'filepath': downloaded_files['my_file']})
1 Like

@lhoestq Perfect! I’ll incorporate this into my workflow and make the necessary changes!