Streaming for Saving

Hi,
I am looking for a way to download a large dataset, transform it and then upload it to another location. Note that the transformations for each instance is independent of others.

I can load the dataset in streaming mode and start the transformation but cannot find a way to write to huggingface hub (in batches) in as download and transformation are ongoing. Wondering if such pattern exists

2 Likes

In the case of datasets library’s push_to_hub, I think you couldn’t upload the data unless all of it was available…
If the files are outputted frequently, in the worst case, there is a way to manually upload them one after another using HfApi…

What if I just want to transform the dataset and then save in a streaming way? If the dataset is large, the CPU occupied memory become larger when transfering the dataset. Or should I transfer and save the data in parts instead of waiting until the entire dataset has been transferred?

1 Like

Yeah. It is now possible to save parquet files per shard or upload them incrementally.

1 Like