How to disable caching in load_dataset()?

Hi !

a) I don’t want to save anything to the ~/.cache/huggingface/datasets/ as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None) and setting datasets.disable_caching() but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map() and dataset.filter() but not in load_dataset() . How do I disable all types of caching?

Indeed currently disable_caching uses a temp directory when saving intermediate map results, but load_dataset still writes the original dataset in ~/.cache/huggingface/datasets/

b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow vs parquet format in Step 3 above?

It depends on the dataset size and your training setup, but usually it’s fine using Arrow. For bigger datasets you may use Parquet instead and use streaming=True

1 Like