How to change my train, test, validation split?

I recently created a dataset consisting information from over 20k repositories. I was able to upload it to the hub, but I now see that my train test split is very wrong with 8k rows in train, 8k in test, and 36k in validation. Is there a way to simply adjust the split, or do I need to create the dataset all over again?

The code I used to load the dataset originally was

the_pile_parsed = load_dataset(“json”, data_files=“parsed/*.jsonl”, split=“train”)

Hi! You can either re-upload the corrected files or use Dataset.select + concatenate_datasets to adjust the splits after loading. If you choose the latter option, you can specify the code needed to adjust the splits in the README file on the Hub, which users can then follow.