Accelerate deepspeed cache mount

I am working on a SLURM Cluster where my native “/home/user-id” directory is not writable. When I launch my training script via accelerate the following error occurs:

File "/work/user-id/justizscrap-bert/.env/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 55, in __init__
    cache_manager = AutotuneCacheManager(cache_key)
  File "/work/user-id/justizscrap-bert/.env/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 55, in __init__
    os.makedirs(self.cache_dir, exist_ok=True)
  File "/sw/env/gcc-10.3.0_openmpi-4.1.1/pkgsrc/2022Q1/lib/python3.9/os.py", line 215, in makedirs
    os.makedirs(self.cache_dir, exist_ok=True)
      File "/sw/env/gcc-10.3.0_openmpi-4.1.1/pkgsrc/2022Q1/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
      File "/sw/env/gcc-10.3.0_openmpi-4.1.1/pkgsrc/2022Q1/lib/python3.9/os.py", line 225, in makedirs
makedirs(head, exist_ok=exist_ok)
      File "/sw/env/gcc-10.3.0_openmpi-4.1.1/pkgsrc/2022Q1/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
    OSErrormkdir(name, mode): 
[Errno 30] Read-only file system: '/home/user-id/.triton'OSError

I know that I’ll have to specify some environment variable such that the deepspeed cache mounts on the writable “/work/user-id” dictionary. The working directory as well as the environment are already specified on the writable “/work/user-id” directory.

I have already exported these variables to no avail (some of which are redundant but there as a fail safe):

export HF_DATASETS_CACHE=/work/user-id/justizscrap-bert/
export TORCH_HOME=/work/user-id/justizscrap-bert/
export HF_HOME=/work/user-id/justizscrap-bert/
export TRANSFORMERS_CACHE=/work/user-id/justizscrap-bert/
export HUGGINGFACE_HUB_CACHE=/work/user-id/justizscrap-bert/
export XDG_CACHE_HOME=/work/user-id/justizscrap-bert/huggingface
export TORCH_EXTENSIONS_DIR=/work/user-id/justizscrap-bert/

Which variable corresponds to the cache directory for the above

deepspeed/ops/transformer/inference/triton/matmul_ext.py

call?

Q&A Style

The matmul_ext.py script uses the

TRITON_CACHE_DIR=/xyz/

variable as cache mount.

(see https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/transformer/inference/triton/matmul_ext.py )