As far as I know, Datasets doesn’t currently support push_to_hub for streaming datasets, and running
push_to_hub
to will overwrite your dataset.
However, it might be possible for you to implement something similar to how push_to_hub
was implemented? Here they use HfApi.upload_file
(datasets/arrow_dataset.py at master · huggingface/datasets · GitHub) to upload each shard, but you could even use the new create_commit
function: Upload files to the Hub. You’d also need to keep track of the dataset info (dataset size, number of examples for each split, number of bytes, etc…, full list here: Main classes) so that you can also upload it like it’s done here: datasets/arrow_dataset.py at master · huggingface/datasets · GitHub
Maybe there’s an easier way though, so I’ll pass this along to the Datasets team to see if they have any other thoughts!