How to disable caching in load_dataset()?

I am doing the following three steps for a large number of iterations:

  1. Loading a parquet file using load_dataset().
  2. Tokenize it using dataset.map() and HuggingFace tokenizers.
  3. Saving the tokenised dataset on disk in arrow format.

I have the following questions:
a) I don’t want to save anything to the ~/.cache/huggingface/datasets/ as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None) and setting datasets.disable_caching() but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map() and dataset.filter() but not in load_dataset(). How do I disable all types of caching?

b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow vs parquet format in Step 3 above?

@lhoestq @mariosasko @albertvillanova

1 Like

Hi !

a) I don’t want to save anything to the ~/.cache/huggingface/datasets/ as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None) and setting datasets.disable_caching() but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map() and dataset.filter() but not in load_dataset() . How do I disable all types of caching?

Indeed currently disable_caching uses a temp directory when saving intermediate map results, but load_dataset still writes the original dataset in ~/.cache/huggingface/datasets/

b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow vs parquet format in Step 3 above?

It depends on the dataset size and your training setup, but usually it’s fine using Arrow. For bigger datasets you may use Parquet instead and use streaming=True

1 Like

I assume someone who is trying to disable the cache would not like the undocumented side effect that load_dataset ignores it. Is this a bug?

I agree it’s kinda confusing though, we might change that at one point and make load_dataset use a temporary directory in that case, though it’s maybe not a trivial change

Btw it’s mentioned here right now: disable_daching() docs

I am getting very different performance results, when running to consecutive runs with the same code (first run: 99% accuracy, second run: 60%). I am using the datasets package and dataset cifar10. There is no error message or anything. After two months of trying to pin down the bug, I am guessing it is a caching problem. Is there a way to visualize the caching structure of datasets?
I also noticed that every function of datasets has a caching dir, which make it really hard for me to get a sense of how the caching structure is.
Do you have any advice how to make runs reproducible?

any news on this ? kindly need to disable cache as well in load_dataset since the cache will use triple volumes compare to the raw data.

Btw if you would like to save disk space, please consider loading the dataset in Streaming mode (streaming=True in load_dataset).

If it can help, note that it’s possible to write a streaming dataset locally into a Dataset object using

ds = Dataset.from_generator(streaming_dataset.__iter__)