Streaming for Saving

Hi,
I am looking for a way to download a large dataset, transform it and then upload it to another location. Note that the transformations for each instance is independent of others.

I can load the dataset in streaming mode and start the transformation but cannot find a way to write to huggingface hub (in batches) in as download and transformation are ongoing. Wondering if such pattern exists

1 Like

In the case of datasets library’s push_to_hub, I think you couldn’t upload the data unless all of it was available…
If the files are outputted frequently, in the worst case, there is a way to manually upload them one after another using HfApi…