Python: 3.9.7
Datsets: 2.1.0
I have a large dataset that I have opted to create by writing a dataset loading script following these instructions. Unfortunately, my workflow is constrained to loading the dataset and then splitting it into train and test sets. Initially when I load the dataset, I’d like it to not have any splits:
>>> my_ds
>>> Dataset({
features: [label', 'data'],
num_rows: 1000000
})
and then I would like to call train_test_split()
on it:
final_ds = my_dataset.train_test_split(train_size=0.9, shuffle=True)
I noticed that in the instructions listed above, there is a _split_generators
method that segments the data into individual splits. Something like:
def _split_generators(self, dl_manager: DownloadManager) -> List[datasets.SplitGenerator]:
urls_to_use = _URLS
downloaded_files = dl_manager.download_and_extract(urls_to_use)
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={'filepath': downloaded_files['my_file']})
]
Using the above _split_generators()
, I can successfully run the loading script and create the dataset, however it has a train
split associated with it. Is there a way to write the loading script so it doesn’t create a split in it as shown above in the dataset with no split?
Thanks in advance for your help and guidance!!!