I recently created a dataset consisting information from over 20k repositories. I was able to upload it to the hub, but I now see that my train test split is very wrong with 8k rows in train, 8k in test, and 36k in validation. Is there a way to simply adjust the split, or do I need to create the dataset all over again?
The code I used to load the dataset originally was
the_pile_parsed = load_dataset(“json”, data_files=“parsed/*.jsonl”, split=“train”)