Pushing multiple splits of dataset to a single repo of Hub

mohamed-illiyas · March 9, 2022, 4:57am

I have a audio dataset dict of 450000+ records. I splitted the dataset into 4 splits and after processing I tried to push it to the hub. I am pushing each split seperately. I pushed the 1 st split, it went well. When I am pushing 2 nd split, the data of the 1 st split is replaced by the second split data in the repository. How can I add/push data to the already existing data in the repository.

mariosasko · April 7, 2022, 12:03pm

Hi! Instead of pushing each split separately, it’s better to create a DatasetDict and push everything in a single call to push_to_hub. You can do this as follows:

from datasets import DatasetDict
ddict = DatasetDict({
    "split1": split1_ds,   # split1_ds is an instance of `datasets.Dataset`
    "split2": split2_ds,
    "split3": split3_ds,
    "split4": split4_ds,
})
ddict.push_to_hub("repo_id")

If you still want to push the sub-datasets separately, then make sure that the name of each split is unique (you can control this with the split parameter in push_to_hub) and that you use ignore_verifications=True when loading the dataset from the Hub (required due to a known bug, will be fixed soon).

Topic		Replies	Views
`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE) 🤗Datasets	6	2666	March 16, 2024
How to overwrite dataset with dataset.push_to_hub() or alternative? 🤗Hub	3	2280	September 20, 2023
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7464	October 20, 2023
How to create subset when pushing to hub 🤗Datasets	3	2590	June 27, 2022
Problem in pushing dataset to the Hub 🤗Datasets	3	681	November 9, 2023

Pushing multiple splits of dataset to a single repo of Hub

Related topics