When is the cache written to file?

crisostomi · May 22, 2023, 11:34am

Hi everyone,

I am having a hard time trying to understand some underlying mechanisms regarding the cache.
For some reason, the same operations are repeated when carried twice; I have checked that each Dataset in my DatasetDict has the right cache_files, but it looks like they are not saved when I terminate the script. A few questions that might help me understand the issue:

When are the files written in the cache?
Is the cache configuration stateful across different runs? e.g., if I disable the cache in a script, is it still disabled in other scripts or in another run of the same script with that line commented out?

I already checked the documentation, but I didn’t find much in these regards.

Thanks a lot,
D.

mariosasko · May 22, 2023, 12:05pm

Hi!

Every method with cache_file_name as a parameter in the signature writes a cache file to disk, and the cache configuration is not stateful.

Regarding repeating a cached operation, this can happen if:

the operation cannot be cached (e.g., a map transform references a non-picklable object). You can use the Hasher object to check this is not the case:
```
from datasets.fingerprint import Hasher

# dummy `map` transform
def transform(ex):
    return {**ex, "a": 2}

h = Hasher()
h.hash(transform)
```
the operation was executed on an in-memory dataset (the .cache_files attribute is empty for in-memory datasets), in which case the cache file is written to a temporary directory and can be reused in the same session but not across different sessions (temporary directories are deleted on exit).

In both these scenarios, the solution is to specify a cache_file_name to make the cache file permanent or avoid “non-deterministic” hashing.

crisostomi · May 22, 2023, 12:35pm

Hello,
thanks for your reply.

The operation can be cached, the Hasher returns a code for the function. The dataset has the cache_files attributes correctly set. Debugging some more, I found out the function to return different hashes between runs.
This is the snippet causing the problem

  map_params = {
      "function": lambda x: {"x": self.transform_func(x["img"])},
      "writer_batch_size": 100,
      "num_proc": 1,
  }

  self.data[f"task_{self.task_ind}"] = self.data[f"task_{self.task_ind}"].map(
      **map_params
  )

and the transform_func is the instantiation of the following

transform_func:
  _target_: torchvision.transforms.Compose
  transforms:
    - _target_: torchvision.transforms.ToTensor
    - _target_: torchvision.transforms.Normalize
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]

What may be happening? The runs have the same fixed seed.

EDIT: I can’t reproduce the problem in a minimal setting, if I just load twice the same function and get its hash it works fine, I am not sure where the problem may originate at this point

crisostomi · May 22, 2023, 2:03pm

Found the issue:
there was an attribute of type set in the class referred by self in the above snippet, it caused the hash function of the whole module to be different across different runs, and therefore also that of self.transform_func was different.

Thanks for your help.

Topic		Replies	Views
Keeping only current dataset state in cache 🤗Datasets	3	1300	August 30, 2022
Sharing the cache folder 🤗Datasets	1	1555	December 12, 2022
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5627	September 18, 2020
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2725	March 22, 2023
Dataset can't cache model's outputs 🤗Datasets	3	474	October 27, 2022

When is the cache written to file?

Related topics