Question on language modeling preprocessing

I am trying running the language modeling script run_mlm.py script but I am facing some storage issues when running the preprocessing of the input text data. The main issue here is that the preprocess data by default gets saved in the .cache/huggingface/datasets folder. But my .cache folder is pretty small. Is it possible to redirect the preprocessing of the input text data to a different folder?

Thanks a lot for your help.

You can set an environment variable to control where the cache goes and change that default. For all HF libraries, the variable is "HF_HOME".

1 Like

Thanks for the quick reply. It works like a charm.

1 Like