Incrementally adding processed examples to a dataset

NimaBoscarino · June 21, 2022, 10:49pm

As far as I know, Datasets doesn’t currently support push_to_hub for streaming datasets, and running push_to_hub to will overwrite your dataset.

However, it might be possible for you to implement something similar to how push_to_hub was implemented? Here they use HfApi.upload_file (datasets/arrow_dataset.py at master · huggingface/datasets · GitHub) to upload each shard, but you could even use the new create_commit function: Upload files to the Hub. You’d also need to keep track of the dataset info (dataset size, number of examples for each split, number of bytes, etc…, full list here: Main classes) so that you can also upload it like it’s done here: datasets/arrow_dataset.py at master · huggingface/datasets · GitHub

Maybe there’s an easier way though, so I’ll pass this along to the Datasets team to see if they have any other thoughts!

Topic		Replies	Views
Allow streaming of large datasets with image/audio 🤗Datasets	18	3971	May 30, 2022
Unable to upload large audio dataset using push_to_hub 🤗Datasets	5	865	November 17, 2023
`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE) 🤗Datasets	6	2705	March 16, 2024
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7529	October 20, 2023
Adding more data to the dataset uploaded on HF 🤗Datasets	4	374	January 15, 2024