Incrementally adding processed examples to a dataset

As far as I know, :hugs: Datasets doesn’t currently support push_to_hub for streaming datasets, and running push_to_hub to will overwrite your dataset.

However, it might be possible for you to implement something similar to how push_to_hub was implemented? Here they use HfApi.upload_file (datasets/arrow_dataset.py at master · huggingface/datasets · GitHub) to upload each shard, but you could even use the new create_commit function: Upload files to the Hub. You’d also need to keep track of the dataset info (dataset size, number of examples for each split, number of bytes, etc…, full list here: Main classes) so that you can also upload it like it’s done here: datasets/arrow_dataset.py at master · huggingface/datasets · GitHub

Maybe there’s an easier way though, so I’ll pass this along to the Datasets team to see if they have any other thoughts!