What is the recommended way to build and upload a large dataset that doesn’t fit in memory to the HF hub?
Is there any way of stringing together existing dataset builder / constructor methods and push_to_hub to do this straightforwardly?
What is the recommended way to build and upload a large dataset that doesn’t fit in memory to the HF hub?
Is there any way of stringing together existing dataset builder / constructor methods and push_to_hub to do this straightforwardly?
Feel free to use load_dataset or Dataset.from_generator to get a Dataset object from your large data source. It writes the data on disk so it can load datasets bigger than memory
Then push_to_hub() uploads file by file of 500MB each by default so that you can upload a dataset that doesn’t fit in memory as well
Thanks for the response!
What if the dataset is also larger than the memory-mapping can handle? ( assuming this is what you mean by writing to disk in the from_generator case)
memory mapping can handle datasets as long as they fit on your disk ![]()
oh cool! thanks
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.