How to handle the cache system properly?

load_dataset() and save_to_disk()

Even though they are saved in the same Arrow format, there is a difference in purpose between long-term storage and internal caching for speed optimization. Some people seem to be trying to reuse them.

Additionally, while I am unsure how robust it is in multi-process or multi-user environments, one possible solution to avoid re-downloading the cache is to simply set HF_HOME to a relatively fast shared folder on the network. For remote environments, services like S3 seem to be available by default. However, this does come at a cost…

By the way, for enterprise use cases, there is an option to consult dedicated support on Hugging Face. Whether this is suitable or not will depend on the scale of the project.