When is the cache written to file?

Hi everyone,

I am having a hard time trying to understand some underlying mechanisms regarding the cache.
For some reason, the same operations are repeated when carried twice; I have checked that each Dataset in my DatasetDict has the right cache_files, but it looks like they are not saved when I terminate the script. A few questions that might help me understand the issue:

  • When are the files written in the cache?
  • Is the cache configuration stateful across different runs? e.g., if I disable the cache in a script, is it still disabled in other scripts or in another run of the same script with that line commented out?

I already checked the documentation, but I didn’t find much in these regards.

Thanks a lot,
D.

Hi!

Every method with cache_file_name as a parameter in the signature writes a cache file to disk, and the cache configuration is not stateful.

Regarding repeating a cached operation, this can happen if:

  • the operation cannot be cached (e.g., a map transform references a non-picklable object). You can use the Hasher object to check this is not the case:
    from datasets.fingerprint import Hasher
    
    # dummy `map` transform
    def transform(ex):
        return {**ex, "a": 2}
    
    h = Hasher()
    h.hash(transform)
    
  • the operation was executed on an in-memory dataset (the .cache_files attribute is empty for in-memory datasets), in which case the cache file is written to a temporary directory and can be reused in the same session but not across different sessions (temporary directories are deleted on exit).

In both these scenarios, the solution is to specify a cache_file_name to make the cache file permanent or avoid “non-deterministic” hashing.

Hello,
thanks for your reply.

The operation can be cached, the Hasher returns a code for the function. The dataset has the cache_files attributes correctly set. Debugging some more, I found out the function to return different hashes between runs.
This is the snippet causing the problem

  map_params = {
      "function": lambda x: {"x": self.transform_func(x["img"])},
      "writer_batch_size": 100,
      "num_proc": 1,
  }

  self.data[f"task_{self.task_ind}"] = self.data[f"task_{self.task_ind}"].map(
      **map_params
  )

and the transform_func is the instantiation of the following

transform_func:
  _target_: torchvision.transforms.Compose
  transforms:
    - _target_: torchvision.transforms.ToTensor
    - _target_: torchvision.transforms.Normalize
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]

What may be happening? The runs have the same fixed seed.

EDIT: I can’t reproduce the problem in a minimal setting, if I just load twice the same function and get its hash it works fine, I am not sure where the problem may originate at this point

Found the issue:
there was an attribute of type set in the class referred by self in the above snippet, it caused the hash function of the whole module to be different across different runs, and therefore also that of self.transform_func was different.

Thanks for your help.