Uploading a dataset that doesn't fit in memory to the HF hub

What is the recommended way to build and upload a large dataset that doesn’t fit in memory to the HF hub?

Is there any way of stringing together existing dataset builder / constructor methods and push_to_hub to do this straightforwardly?

1 Like

Feel free to use load_dataset or Dataset.from_generator to get a Dataset object from your large data source. It writes the data on disk so it can load datasets bigger than memory

Then push_to_hub() uploads file by file of 500MB each by default so that you can upload a dataset that doesn’t fit in memory as well

1 Like

Thanks for the response!

What if the dataset is also larger than the memory-mapping can handle? ( assuming this is what you mean by writing to disk in the from_generator case)

memory mapping can handle datasets as long as they fit on your disk :wink:

1 Like

oh cool! thanks

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.