I have a audio dataset dict of 450000+ records. I splitted the dataset into 4 splits and after processing I tried to push it to the hub. I am pushing each split seperately. I pushed the 1 st split, it went well. When I am pushing 2 nd split, the data of the 1 st split is replaced by the second split data in the repository. How can I add/push data to the already existing data in the repository.
1 Like
Hi! Instead of pushing each split separately, it’s better to create a DatasetDict
and push everything in a single call to push_to_hub
. You can do this as follows:
from datasets import DatasetDict
ddict = DatasetDict({
"split1": split1_ds, # split1_ds is an instance of `datasets.Dataset`
"split2": split2_ds,
"split3": split3_ds,
"split4": split4_ds,
})
ddict.push_to_hub("repo_id")
If you still want to push the sub-datasets separately, then make sure that the name of each split is unique (you can control this with the split
parameter in push_to_hub
) and that you use ignore_verifications=True
when loading the dataset from the Hub (required due to a known bug, will be fixed soon).
1 Like