`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE)

pietrolesci · April 22, 2022, 10:00am

Hi there,

I am trying to push_to_hub to create a dataset composed of multiple subsets (e.g., “dataset_1”, “dataset_2”, etc) and, within each subset, different splits (e.g., “train”, “test”, “dev”) - like the GLUE dataset already available on the Hub.

Is there a way to do it?

Thanks a lot in advance for your help!

merve · April 25, 2022, 12:45pm

Hello and welcome to Forum

If you want your splits to be loaded programmatically, you can implement a dataset loading script like it’s done in GLUE.

Let me know if it helps

lhoestq · April 26, 2022, 3:37pm

Hi ! we are working on this

Ultimately with push_to_hub you will be able to have several subsets, one per directory as defined in our documentation on how to structure your dataset repository (but with Parquet files instead of CSV)

yeshwanthv5 · September 6, 2023, 7:12pm

Hi I have the same question is there a way to do this?

lhoestq · September 8, 2023, 8:54am

You can now use push_to_hub to push multiple subsets of your dataset ! e.g.

dataset_subset1.push_to_hub("username/dataset_name", "subset1")
dataset_subset2.push_to_hub("username/dataset_name", "subset2")

# later

dataset_subset1 = load_dataset("username/dataset_name", "subset1")
dataset_subset2 = load_dataset("username/dataset_name", "subset2")

Each subset can be a DatasetDict made of multiple splits, or you can upload one split at a time:

dataset_subset1_train.push_to_hub("username/dataset_name", "subset1", split="train")
dataset_subset1_test.push_to_hub("username/dataset_name", "subset1", split="test")

# later

dataset_subset1_train = load_dataset("username/dataset_name", "subset1", split="train")
dataset_subset1_test = load_dataset("username/dataset_name", "subset1", split="test")

stepkurniawan · October 20, 2023, 9:19am

Hi! @lhoestq I think I found a bug whenever you tried to overwrite what you have pushed before.
Could you check out my post please? Load_dataset() doesn’t load ONE of the Subset - Beginners - Hugging Face Forums

Siki-77 · March 16, 2024, 10:31am

Seeing the decription from Manual Configuration (huggingface.co), you can just add README.md by huggingface UI

Topic		Replies	Views
Pushing multiple splits of dataset to a single repo of Hub 🤗Datasets	1	2471	April 7, 2022
Create multiple dataset configs with `push_to_hub()` method? 🤗Datasets	1	642	November 3, 2022
How to create subset when pushing to hub 🤗Datasets	3	2562	June 27, 2022
Incrementally adding processed examples to a dataset 🤗Datasets	4	1383	June 23, 2022
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7422	October 20, 2023

`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE)

Related topics