Recommended max size of dataset?

lhoestq · March 11, 2025, 3:22pm

save_to_disk / load_from_disk can handle big datasets, you can even use multiprocessing with num_proc= to accelerate save_to_disk

though performance can depend on your environment so I’d still advise you to try on smaller datasets first and see how it scales

Topic		Replies	Views
How to load a large hf dataset efficiently? 🤗Datasets	5	2382	January 22, 2024
Streaming in dataset uploads 🤗Datasets	2	52	March 31, 2025
Big text dataset loading for training 🤗Datasets	2	98	May 7, 2025
Request for Additional Storage Space for Dataset Repository 🤗Datasets	3	109	October 11, 2024
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	73	October 24, 2024