Hi,
I am looking for a way to download a large dataset, transform it and then upload it to another location. Note that the transformations for each instance is independent of others.
I can load the dataset in streaming mode and start the transformation but cannot find a way to write to huggingface hub (in batches) in as download and transformation are ongoing. Wondering if such pattern exists
In the case of datasets library’s push_to_hub, I think you couldn’t upload the data unless all of it was available…
If the files are outputted frequently, in the worst case, there is a way to manually upload them one after another using HfApi…
What if I just want to transform the dataset and then save in a streaming way? If the dataset is large, the CPU occupied memory become larger when transfering the dataset. Or should I transfer and save the data in parts instead of waiting until the entire dataset has been transferred?