Re: lazy data loading. After a talk with @valhalla, there is actually no need for special code to run “high-resource” language models. The datasets
library never loads the whole dataset into RAM, when applying the .map()
function. It only loads writer_batch_size
samples into RAM when using .map()
- see docs here and then saves the mapped batch to the disk. You can increase or decrease the function argument writer_batch_size
when using .map(...)
to fit your needs best.
This means that every .map(...)
call saves a significant amount of data onto disk meaning if you use .map(...)
three times of a dataset of size 100GB it will cache 300GB of data.
Therefore you can do two things to reduce the required amount of hard drive storage:
-
Remove the cache regularly. This can be as easy as
rm -r ~/.cache/huggingface/datasets
to remove all cached datasets or making use of this convenient function: https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cache#datasets.Dataset.cleanup_cache_files that only remove a dateset specific cache -
Try to use as few
.map(...)
operations as possible.