I wanted to upload json, jsonl, zip files as subsets but now I know how to upload parquet as subset files. is there any way using API or cli where I could upload any file-type as subset in a dataset repo?
2 Likes
It seems that the only way to create a subset of a dataset with a normal file is to write a script. However, this forum post is two years old, so the situation may have changed.
Hi ! There is no direct API for that yet, but you can set the subset programmatically by editing the Dataset Card data (aka the YAML at the top of the README.md).
See what the YAML should look like if you want multiple configs here: Manual Configuration
See how to manipulate Dataset Cards here: Repository Cards
For example:
from huggingface_hub import DatasetCard
dataset_card = DatasetCard.load("username/dataset_name")
dataset_card.data["configs"] = [
{"config_name": "subset0", "data_dir": "subset0"},
{"config_name": "subset1", "data_dir": "subset1"},
]
dataset_card.push_to_hub("username/dataset_name")
which creates two subsets “subset0” and “subset1” for a dataset structured like this
my_dataset/
├── README.md
├── subset0/
│ ├── abc.jsonl
│ └── def.jsonl
└── subset1/
└── ghi.jsonl
1 Like