Hi there,
I am trying to push_to_hub
to create a dataset composed of multiple subsets (e.g., “dataset_1”, “dataset_2”, etc) and, within each subset, different splits (e.g., “train”, “test”, “dev”) - like the GLUE dataset already available on the Hub.
Is there a way to do it?
Thanks a lot in advance for your help!
2 Likes
merve
April 25, 2022, 12:45pm
2
Hello and welcome to Forum
If you want your splits to be loaded programmatically, you can implement a dataset loading script like it’s done in GLUE .
Let me know if it helps
2 Likes
Hi ! we are working on this
Ultimately with push_to_hub
you will be able to have several subsets, one per directory as defined in our documentation on how to structure your dataset repository (but with Parquet files instead of CSV)
2 Likes
Hi I have the same question is there a way to do this?
lhoestq
September 8, 2023, 8:54am
5
You can now use push_to_hub
to push multiple subsets of your dataset ! e.g.
dataset_subset1.push_to_hub("username/dataset_name", "subset1")
dataset_subset2.push_to_hub("username/dataset_name", "subset2")
# later
dataset_subset1 = load_dataset("username/dataset_name", "subset1")
dataset_subset2 = load_dataset("username/dataset_name", "subset2")
Each subset can be a DatasetDict made of multiple splits, or you can upload one split at a time:
dataset_subset1_train.push_to_hub("username/dataset_name", "subset1", split="train")
dataset_subset1_test.push_to_hub("username/dataset_name", "subset1", split="test")
# later
dataset_subset1_train = load_dataset("username/dataset_name", "subset1", split="train")
dataset_subset1_test = load_dataset("username/dataset_name", "subset1", split="test")
4 Likes
Hi! @lhoestq I think I found a bug whenever you tried to overwrite what you have pushed before.
Could you check out my post please? Load_dataset() doesn’t load ONE of the Subset - Beginners - Hugging Face Forums
Seeing the decription from Manual Configuration (huggingface.co) , you can just add README.md by huggingface UI
1 Like