How to disable caching in load_dataset()?

varadhbhatnagar · January 30, 2024, 7:23am

I am doing the following three steps for a large number of iterations:

Loading a parquet file using load_dataset().
Tokenize it using dataset.map() and HuggingFace tokenizers.
Saving the tokenised dataset on disk in arrow format.

I have the following questions:
a) I don’t want to save anything to the ~/.cache/huggingface/datasets/ as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None) and setting datasets.disable_caching() but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map() and dataset.filter() but not in load_dataset(). How do I disable all types of caching?

b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow vs parquet format in Step 3 above?

@lhoestq @mariosasko @albertvillanova

Topic		Replies	Views
How to load parquet to datasets without caching? 🤗Datasets	1	3391	June 24, 2022
Using local dataset without changing cache 🤗Datasets	2	459	September 6, 2023
How to handle the cache system properly? 🤗Datasets	3	42	August 2, 2025
Load dataset from a specific cache file 🤗Datasets	3	1296	February 26, 2024
Map result saved to a different folder than custom HF_DATASETS_CACHE 🤗Datasets	1	678	June 14, 2022

How to disable caching in load_dataset()?

Related topics