`load_from_cache_file` not working


I’m using the datasets library to load in the popular medical dataset MIMIC 3 (only the notes) and creating a huggingface dataset to get it ready for language modelling using BERT. I have a script that loads creates a custom dataset and tokenizes it and writes it to the cache file. I set load_from_cache_file in the map function of the dataset to True. The tokenization process takes a while because of the moderately large dataset size.

When I run the script again, I would expect the dataset to be loaded from the cache file quickly and move on to the next step in the process. However, even though I get a warning saying Reusing dataset text, the tokenization process is running again. This is not good because I intend of training multiple models and measuring the training time as part of a project.

However, I did notice that the time it takes to run when running the second time is less. For example, when running the first time without any cache file, the process took 25 minutes. When running the second time with the cache file available and load_from_cache_file set to True, it took 13 minutes. Is this expected behavior?

I’m using version 1.6.1 of the datasets library. I would like some help in resolving this problem. Please let me know if more info is needed.

Hi !

Have you tried using the latest release 1.6.2 ? Have you changed the parameters passed to map between the two runs ?
If the parameters (both the tokenization function, the batch size, etc.) are the same, then it will reload the tokenized text from the cache.

Also if you share the code that you’re using it would be helpful :slight_smile: