Hi ! Sure the
datasets library is designed to support the processing of large scale datasets. Datasets are loaded using memory mapping from your disk so it doesn’t fill your RAM. You can parallelize your data processing using
map since it supports multiprocessing. Then you can save your processed dataset using
save_to_disk, and reload it later using
from datasets import load_dataset, load_from_disk
dataset = load_dataset(...)
dataset = dataset.map(..., num_proc=num_processes)
dataset = load_from_disk("path/to/save/directory")
Things worth noticing:
- you can specify a
cache_dir parameter in
load_dataset so that you can store the raw + prepared data wherever you want and be able to delete them later to save space if needed.
- If you are working on a cluster with a virtual filesystem, you may want to make sure that the memory mapping works efficiently. This is probably the case if you are doing distributed training. There is a discussion about this in here if this is the case. We are still investigating why some virtual filesystems have such behaviors.