Hi,
I’m using the datasets
library to load in the popular medical dataset MIMIC 3 (only the notes) and creating a huggingface dataset
to get it ready for language modelling using BERT. I have a script that loads creates a custom dataset and tokenizes it and writes it to the cache file. I set load_from_cache_file
in the map
function of the dataset
to True
. The tokenization process takes a while because of the moderately large dataset size.
When I run the script again, I would expect the dataset to be loaded from the cache file quickly and move on to the next step in the process. However, even though I get a warning saying Reusing dataset text
, the tokenization process is running again. This is not good because I intend of training multiple models and measuring the training time as part of a project.
However, I did notice that the time it takes to run when running the second time is less. For example, when running the first time without any cache file, the process took 25 minutes. When running the second time with the cache file available and load_from_cache_file
set to True
, it took 13 minutes. Is this expected behavior?
I’m using version 1.6.1
of the datasets
library. I would like some help in resolving this problem. Please let me know if more info is needed.