How to handle the cache system properly?

jtourille · July 31, 2025, 2:20pm

Hello,

I’ve been struggling with large datasets on our HPC cluster and I think I don’t understand fully how to use the datasets library.

I want to use fineweb-edu, so I downloaded the dataset using snapshot_download as presented below. I skip the sample parts. The dataset is stored on a NFS server.

...

_ = snapshot_download(
    "HuggingFaceFW/fineweb-edu",
    repo_type="dataset",
    allow_patterns=["data/*"],
    local_dir=os.path.join(target_dir, "HuggingFaceFW/fineweb-edu"),
)

...

Now I want to load this dataset. As I understand, load_dataset will create an optimized cache version of this dataset. So I specify the location of the cache using the HF_HOME variable and load the dataset.

...

_ = load_dataset(os.path.join(target_dir, "HuggingFaceFW/fineweb-edu", "data"), num_proc=12)

...

The cached version is correctly built, so far so good. Now I want to load this cached version without having access to the original parquet files. How can I do that ? I’ve read the documentation several times and the best way seems to use the arrow builder with load_dataset("arrow", data_files=data_files) but the process is building another cached version of this dataset. I’ve tried to disable caching without success.

I am sure I am missing something, could someone point me in the right direction ?

Thanks,

Julien

John6666 · July 31, 2025, 3:29pm

Some resources for now:

jtourille · August 1, 2025, 3:43pm

Thanks for the pointers, I’ve read them already.

I think my question could be rephrased: how can we manage large datasets in large companies ? More precisely, I am looking at a way to avoid to have multiple copies of the same dataset at different locations and minimize the number of download (ideally only one).

I’ve been experimenting since my initial question and I’ve come up with this workflow:

Download the dataset with snapshot_download() or git.
One person load the dataset with load_dataset() and export it with save_to_disk()
Other people that want to use the dataset can make a local copy, do their stuff and remove everything afterwards.

Does that make sense ?

One thing I don’t understand is that load_dataset() and save_to_disk() both save the dataset in arrow format. However, as far as I understand, they do not perform the same optimizations, and therefore load_from_disk cannot load the dataset directly from the cache. Is there any particular reason for that ?

Bests

Julien

John6666 · August 2, 2025, 2:16am

load_dataset() and save_to_disk()

Even though they are saved in the same Arrow format, there is a difference in purpose between long-term storage and internal caching for speed optimization. Some people seem to be trying to reuse them.

Additionally, while I am unsure how robust it is in multi-process or multi-user environments, one possible solution to avoid re-downloading the cache is to simply set HF_HOME to a relatively fast shared folder on the network. For remote environments, services like S3 seem to be available by default. However, this does come at a cost…

By the way, for enterprise use cases, there is an option to consult dedicated support on Hugging Face. Whether this is suitable or not will depend on the scale of the project.

Topic		Replies	Views
Best way to access the cached transformation arrow file 🤗Datasets	9	3175	January 19, 2024
How to load cached dataset offline? Beginners	2	4676	May 29, 2022
Load dataset from a specific cache file 🤗Datasets	3	1358	February 26, 2024
Map result saved to a different folder than custom HF_DATASETS_CACHE 🤗Datasets	1	690	June 14, 2022
Loading dataset from cache .arrow file 🤗Datasets	1	772	March 28, 2023

How to handle the cache system properly?

Related topics