What is the recommended way to build and upload a large dataset that doesn’t fit in memory to the HF hub?
Is there any way of stringing together existing dataset builder / constructor methods and push_to_hub to do this straightforwardly?
What is the recommended way to build and upload a large dataset that doesn’t fit in memory to the HF hub?
Is there any way of stringing together existing dataset builder / constructor methods and push_to_hub to do this straightforwardly?
Feel free to use load_dataset or Dataset.from_generator
to get a Dataset object from your large data source. It writes the data on disk so it can load datasets bigger than memory
Then push_to_hub()
uploads file by file of 500MB each by default so that you can upload a dataset that doesn’t fit in memory as well
Thanks for the response!
What if the dataset is also larger than the memory-mapping can handle? ( assuming this is what you mean by writing to disk in the from_generator case)
memory mapping can handle datasets as long as they fit on your disk
oh cool! thanks
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.