Pipeline with custom dataset tokenizer: when to save/load manually

The caching should work across sessions, normally you don’t have to use save_to_disk. The cache is indexed by a hash of the operations performed on the dataset, if a new, independent, session performs the same operations, they will use the cache instead of being recomputed. If you change something to the operation performed on the dataset, they will be recomputed instead of using the cache.

I will add a detail on the hashing mechanism to the doc when I have some time (no ETA) but basically it use as hash to store the dataset a complete pickle dump of all the arguments you provide the processing function at each step (including the function provided to map) so if anything changes it will be detected and the operation is recomputed instead of using the cache. If all the arguments and inputs are identical, the hash is the same (whether it’s the same session or not) and the cache file is used if it is found.

save_to_disk is provided as a special utility mostly for people who preprocess a dataset on one machine which has access to the internet and would like to use the dataset on a cluster without any access to the internet (and which thus cannot download the dataset files).

1 Like