How to disable caching in load_dataset()?

varadhbhatnagar · January 30, 2024, 7:23am

I am doing the following three steps for a large number of iterations:

Loading a parquet file using load_dataset().
Tokenize it using dataset.map() and HuggingFace tokenizers.
Saving the tokenised dataset on disk in arrow format.

I have the following questions:
a) I don’t want to save anything to the ~/.cache/huggingface/datasets/ as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None) and setting datasets.disable_caching() but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map() and dataset.filter() but not in load_dataset(). How do I disable all types of caching?

b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow vs parquet format in Step 3 above?

@lhoestq @mariosasko @albertvillanova

lhoestq · February 12, 2024, 2:37pm

Hi !

a) I don’t want to save anything to the ~/.cache/huggingface/datasets/ as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None) and setting datasets.disable_caching() but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map() and dataset.filter() but not in load_dataset() . How do I disable all types of caching?

Indeed currently disable_caching uses a temp directory when saving intermediate map results, but load_dataset still writes the original dataset in ~/.cache/huggingface/datasets/

b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow vs parquet format in Step 3 above?

It depends on the dataset size and your training setup, but usually it’s fine using Arrow. For bigger datasets you may use Parquet instead and use streaming=True

jahb57 · February 21, 2024, 4:09pm

I assume someone who is trying to disable the cache would not like the undocumented side effect that load_dataset ignores it. Is this a bug?

lhoestq · February 21, 2024, 5:17pm

I agree it’s kinda confusing though, we might change that at one point and make load_dataset use a temporary directory in that case, though it’s maybe not a trivial change

Btw it’s mentioned here right now: disable_daching() docs

SophieOstmeier · May 9, 2024, 4:41pm

I am getting very different performance results, when running to consecutive runs with the same code (first run: 99% accuracy, second run: 60%). I am using the datasets package and dataset cifar10. There is no error message or anything. After two months of trying to pin down the bug, I am guessing it is a caching problem. Is there a way to visualize the caching structure of datasets?
I also noticed that every function of datasets has a caching dir, which make it really hard for me to get a sense of how the caching structure is.
Do you have any advice how to make runs reproducible?

bobo8 · July 10, 2024, 6:43am

any news on this ? kindly need to disable cache as well in load_dataset since the cache will use triple volumes compare to the raw data.

lhoestq · July 10, 2024, 2:46pm

Btw if you would like to save disk space, please consider loading the dataset in Streaming mode (streaming=True in load_dataset).

If it can help, note that it’s possible to write a streaming dataset locally into a Dataset object using

ds = Dataset.from_generator(streaming_dataset.__iter__)

Topic		Replies	Views
How to load parquet to datasets without caching? 🤗Datasets	1	3377	June 24, 2022
Using local dataset without changing cache 🤗Datasets	2	452	September 6, 2023
Load dataset from a specific cache file 🤗Datasets	3	1257	February 26, 2024
Map result saved to a different folder than custom HF_DATASETS_CACHE 🤗Datasets	1	675	June 14, 2022
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5630	September 18, 2020

How to disable caching in load_dataset()?

Related topics