Add a subset to a dataset from CLI?

geoffnn · February 5, 2025, 3:19pm

Hi!
I’m trying to modify a dataset on HF, that contains subsets – essentially processing it and re-uploading it. e.g. this one Salesforce/lotsa_data · Datasets at Hugging Face
What I’m doing currently is download the parquet for each subset, process it independently, store it in an S3 folder, and then use the CLI to upload the folder. This does not result in a dataset with subsets…
What’s the fastest way of reproducing the structure of the initial dataset after processing?

I think this is related to https://discuss.huggingface.co/t/create-multiple-dataset-subsets-at-the-same-time/12997
Thanks,

John6666 · February 5, 2025, 3:26pm

The latter is in Japanese, so it may be a little difficult to read. Google translate it…
I think that in the past, creating a loading_script was the best thing to do. Now, it may be smarter to use a custom DatasetBuilder.

Topic		Replies	Views
Uploading json, jsonl files as subset on dataset repo 🤗Datasets	3	123	November 30, 2024
How to create subset when pushing to hub 🤗Datasets	3	2578	June 27, 2022
Is there any ways to download only a subset of dataset using huggingface-cli? 🤗Hub	0	277	July 17, 2024
Huggingface-cli to load_dataset 🤗Datasets	4	3815	March 6, 2024
How does one actually create a new dataset? Beginners	2	3270	October 18, 2024

Add a subset to a dataset from CLI?

Related topics