Pushing multiple splits of dataset to a single repo of Hub

I have a audio dataset dict of 450000+ records. I splitted the dataset into 4 splits and after processing I tried to push it to the hub. I am pushing each split seperately. I pushed the 1 st split, it went well. When I am pushing 2 nd split, the data of the 1 st split is replaced by the second split data in the repository. How can I add/push data to the already existing data in the repository.

1 Like

Hi! Instead of pushing each split separately, it’s better to create a DatasetDict and push everything in a single call to push_to_hub. You can do this as follows:

from datasets import DatasetDict
ddict = DatasetDict({
    "split1": split1_ds,   # split1_ds is an instance of `datasets.Dataset`
    "split2": split2_ds,
    "split3": split3_ds,
    "split4": split4_ds,
})
ddict.push_to_hub("repo_id")

If you still want to push the sub-datasets separately, then make sure that the name of each split is unique (you can control this with the split parameter in push_to_hub) and that you use ignore_verifications=True when loading the dataset from the Hub (required due to a known bug, will be fixed soon).

1 Like