I’m using the
datasets library to load in the popular medical dataset MIMIC 3 (only the notes) and creating a huggingface
dataset to get it ready for language modelling using BERT. I have a script that loads creates a custom dataset and tokenizes it and writes it to the cache file. I set
load_from_cache_file in the
map function of the
True. The tokenization process takes a while because of the moderately large dataset size.
When I run the script again, I would expect the dataset to be loaded from the cache file quickly and move on to the next step in the process. However, even though I get a warning saying
Reusing dataset text, the tokenization process is running again. This is not good because I intend of training multiple models and measuring the training time as part of a project.
However, I did notice that the time it takes to run when running the second time is less. For example, when running the first time without any cache file, the process took 25 minutes. When running the second time with the cache file available and
load_from_cache_file set to
True, it took 13 minutes. Is this expected behavior?
I’m using version
1.6.1 of the
datasets library. I would like some help in resolving this problem. Please let me know if more info is needed.