Pushing multiple splits of dataset to a single repo of Hub

mariosasko · April 7, 2022, 12:03pm

Hi! Instead of pushing each split separately, it’s better to create a DatasetDict and push everything in a single call to push_to_hub. You can do this as follows:

from datasets import DatasetDict
ddict = DatasetDict({
    "split1": split1_ds,   # split1_ds is an instance of `datasets.Dataset`
    "split2": split2_ds,
    "split3": split3_ds,
    "split4": split4_ds,
})
ddict.push_to_hub("repo_id")

If you still want to push the sub-datasets separately, then make sure that the name of each split is unique (you can control this with the split parameter in push_to_hub) and that you use ignore_verifications=True when loading the dataset from the Hub (required due to a known bug, will be fixed soon).

Topic		Replies	Views
`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE) 🤗Datasets	6	2719	March 16, 2024
How to overwrite dataset with dataset.push_to_hub() or alternative? 🤗Hub	3	2313	September 20, 2023
How to create subset when pushing to hub 🤗Datasets	3	2651	June 27, 2022
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7547	October 20, 2023
Problem in pushing dataset to the Hub 🤗Datasets	3	687	November 9, 2023

Pushing multiple splits of dataset to a single repo of Hub

Related topics