How to handle the cache system properly?

Thanks for the pointers, I’ve read them already.

I think my question could be rephrased: how can we manage large datasets in large companies ? More precisely, I am looking at a way to avoid to have multiple copies of the same dataset at different locations and minimize the number of download (ideally only one).

I’ve been experimenting since my initial question and I’ve come up with this workflow:

  1. Download the dataset with snapshot_download() or git.
  2. One person load the dataset with load_dataset() and export it with save_to_disk()
  3. Other people that want to use the dataset can make a local copy, do their stuff and remove everything afterwards.

Does that make sense ?

One thing I don’t understand is that load_dataset() and save_to_disk() both save the dataset in arrow format. However, as far as I understand, they do not perform the same optimizations, and therefore load_from_disk cannot load the dataset directly from the cache. Is there any particular reason for that ?

Bests

Julien

1 Like