How can you overwrite a split? + Possible Bug

Goal
I want to be able to overwrite a split in my dataset. Is there a way to do so?

Current Behavior
When I push to an existing split I get this error:

ValueError: Split complexRoofLocation_01Apr2023_to_31May2023test already present

Is there a way to remove a split, without manually going into the dataset?

Potential Bug
What’s strange is that datasets, despite the operation erroring out form the ValueError above, still overwrites the split:

Pushing dataset shards to the dataset hub: 100% [.....................] 1/1 [00:00<00:00, 55.04it/s]

This makes you feel like the whole operation failed, but in fact your dataset is now changed. That feels like a bug.

Additional Strange Behavior
While it updates the split, it doesn’t update the split’s information. Because of this when you pull down the dataset you may end up getting a NonMatchingSplitsSizesError. I do because my the original split had 5 rows, but upon attempting to override there were only 4. So the dataset states there’s 5 but only 4 exist in the split.

Expected Behavior
This basically corrupts the data. Either it should let the overwrite happen or it shouldn’t do anything.

Appreciate you taking the time to read this!

hi @govindrai ! could you please provide a short code snippet to reproduce this behavior and which version of datasets you use?
also feel free to open an issue with these bugs and how to reproduce them on GitHub :slight_smile:

1 Like

Thanks! Opened bug report. Colab link in bug report!