Sharing the cache folder

Hi All,

Is it safe to share the cache dir across multiple processes and/or users?
We have a number of users running training job acorss a cluster with several dozen GPUs. We often use the same datasets or datasets derived with transformation of the same base dataset.

In this setting, is ti OK to configure the cache dirs to point to the same location for all of the users?
What happens if one person starts a preprocess that someone else is already running (and would procude the same set of files in the cache)?
Another question (though I think I’d already have seen problems if this wasn’t safe) is what happens if multiple of my own processes try the same transforms (ie want to write to the same cache)?

Hi! It should be safe if your system supports file locking (the lib we use for this), a system-level locking mechanism we use when writing cache files. Regarding your additional questions, a cache file in Datasets is:

  • Named uniquely. Each transform (based on the previous fingerprint, params, and the transform) modifies a dataset’s fingerprint, and we use this fingerprint to give unique names to the generated cache files
  • Generated only once and re-used otherwise (thanks to “fingerprint” and file locking).
1 Like