Recommended max size of dataset?

save_to_disk / load_from_disk can handle big datasets, you can even use multiprocessing with num_proc= to accelerate save_to_disk

though performance can depend on your environment so I’d still advise you to try on smaller datasets first and see how it scales

1 Like