How to change my train, test, validation split?

Hamhams · January 24, 2022, 4:31pm

I recently created a dataset consisting information from over 20k repositories. I was able to upload it to the hub, but I now see that my train test split is very wrong with 8k rows in train, 8k in test, and 36k in validation. Is there a way to simply adjust the split, or do I need to create the dataset all over again?

The code I used to load the dataset originally was

the_pile_parsed = load_dataset(“json”, data_files=“parsed/*.jsonl”, split=“train”)

mariosasko · February 9, 2022, 4:23pm

Hi! You can either re-upload the corrected files or use Dataset.select + concatenate_datasets to adjust the splits after loading. If you choose the latter option, you can specify the code needed to adjust the splits in the README file on the Hub, which users can then follow.

Topic		Replies	Views
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5730	August 12, 2022
Datasets.load_dataset not returning 'eval' or 'test' 🤗Datasets	2	683	May 17, 2022
How to split Hugging Face dataset to train and test? 🤗Datasets	5	55136	January 24, 2023
`train_test_split` with IterableDataset 🤗Datasets	2	1811	January 26, 2023
Loading Dataset with custom splits 🤗Datasets	1	529	July 12, 2023

How to change my train, test, validation split?

Related topics