save_to_disk
/ load_from_disk
can handle big datasets, you can even use multiprocessing with num_proc=
to accelerate save_to_disk
though performance can depend on your environment so I’d still advise you to try on smaller datasets first and see how it scales