How to handle the cache system properly?

jtourille · August 1, 2025, 3:43pm

Thanks for the pointers, I’ve read them already.

I think my question could be rephrased: how can we manage large datasets in large companies ? More precisely, I am looking at a way to avoid to have multiple copies of the same dataset at different locations and minimize the number of download (ideally only one).

I’ve been experimenting since my initial question and I’ve come up with this workflow:

Download the dataset with snapshot_download() or git.
One person load the dataset with load_dataset() and export it with save_to_disk()
Other people that want to use the dataset can make a local copy, do their stuff and remove everything afterwards.

Does that make sense ?

One thing I don’t understand is that load_dataset() and save_to_disk() both save the dataset in arrow format. However, as far as I understand, they do not perform the same optimizations, and therefore load_from_disk cannot load the dataset directly from the cache. Is there any particular reason for that ?

Bests

Julien

Topic		Replies	Views
Best way to access the cached transformation arrow file 🤗Datasets	9	3130	January 19, 2024
Load dataset from a specific cache file 🤗Datasets	3	1271	February 26, 2024
How to load cached dataset offline? Beginners	2	4608	May 29, 2022
How to disable caching in load_dataset()? 🤗Datasets	6	6426	July 10, 2024
How to load dataset that exist in cache path Beginners	5	4981	December 6, 2023

How to handle the cache system properly?

Related topics