`load_from_cache_file` not working

Hi,

I’m using the datasets library to load in the popular medical dataset MIMIC 3 (only the notes) and creating a huggingface dataset to get it ready for language modelling using BERT. I have a script that loads creates a custom dataset and tokenizes it and writes it to the cache file. I set load_from_cache_file in the map function of the dataset to True. The tokenization process takes a while because of the moderately large dataset size.

When I run the script again, I would expect the dataset to be loaded from the cache file quickly and move on to the next step in the process. However, even though I get a warning saying Reusing dataset text, the tokenization process is running again. This is not good because I intend of training multiple models and measuring the training time as part of a project.

However, I did notice that the time it takes to run when running the second time is less. For example, when running the first time without any cache file, the process took 25 minutes. When running the second time with the cache file available and load_from_cache_file set to True, it took 13 minutes. Is this expected behavior?

I’m using version 1.6.1 of the datasets library. I would like some help in resolving this problem. Please let me know if more info is needed.