`load_from_cache_file` not working

sudarshan85 · May 4, 2021, 4:34pm

Hi,

I’m using the datasets library to load in the popular medical dataset MIMIC 3 (only the notes) and creating a huggingface dataset to get it ready for language modelling using BERT. I have a script that loads creates a custom dataset and tokenizes it and writes it to the cache file. I set load_from_cache_file in the map function of the dataset to True. The tokenization process takes a while because of the moderately large dataset size.

When I run the script again, I would expect the dataset to be loaded from the cache file quickly and move on to the next step in the process. However, even though I get a warning saying Reusing dataset text, the tokenization process is running again. This is not good because I intend of training multiple models and measuring the training time as part of a project.

However, I did notice that the time it takes to run when running the second time is less. For example, when running the first time without any cache file, the process took 25 minutes. When running the second time with the cache file available and load_from_cache_file set to True, it took 13 minutes. Is this expected behavior?

I’m using version 1.6.1 of the datasets library. I would like some help in resolving this problem. Please let me know if more info is needed.

lhoestq · May 10, 2021, 1:09pm

Hi !

Have you tried using the latest release 1.6.2 ? Have you changed the parameters passed to map between the two runs ?
If the parameters (both the tokenization function, the batch size, etc.) are the same, then it will reload the tokenized text from the cache.

Also if you share the code that you’re using it would be helpful

Topic		Replies	Views
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5636	September 18, 2020
The datasets.map function does not load cached dataset Beginners	7	2282	November 21, 2023
Load dataset from a specific cache file 🤗Datasets	3	1271	February 26, 2024
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2733	March 22, 2023
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	688	November 21, 2023

`load_from_cache_file` not working

Related topics