Pipeline with custom dataset tokenizer: when to save/load manually

thomwolf · September 11, 2020, 8:59pm

The caching should work across sessions, normally you don’t have to use save_to_disk. The cache is indexed by a hash of the operations performed on the dataset, if a new, independent, session performs the same operations, they will use the cache instead of being recomputed. If you change something to the operation performed on the dataset, they will be recomputed instead of using the cache.

I will add a detail on the hashing mechanism to the doc when I have some time (no ETA) but basically it use as hash to store the dataset a complete pickle dump of all the arguments you provide the processing function at each step (including the function provided to map) so if anything changes it will be detected and the operation is recomputed instead of using the cache. If all the arguments and inputs are identical, the hash is the same (whether it’s the same session or not) and the cache file is used if it is found.

save_to_disk is provided as a special utility mostly for people who preprocess a dataset on one machine which has access to the internet and would like to use the dataset on a cluster without any access to the internet (and which thus cannot download the dataset files).

Topic		Replies	Views
`load_from_cache_file` not working 🤗Datasets	1	2175	May 10, 2021
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2733	March 22, 2023
Caching tokenization 🤗Tokenizers	0	243	January 14, 2024
How to disable caching in load_dataset()? 🤗Datasets	6	6426	July 10, 2024
The datasets.map function does not load cached dataset Beginners	7	2282	November 21, 2023

Pipeline with custom dataset tokenizer: when to save/load manually

Related topics