How does cache work?

After I’ve loaded ‘imagenet-1k’ with load_dataset("imagenet-1k") I found that I have two main huge folders in HF_DATASETS_CACHE:

  • datasets/imagenet-1k which about 155GB
  • datasets/downloads which takes about 154GB

Also when I run ds['train'].size_in_bytes/1024**3 I get approximately 310GB which means, I guess, a size of the dataset.
But the total size of files in imagenet repo is about 160GB.
If I run ds.cleanup_cache_files() it doesn’t remove anything.

So my question is what are the actual dataset files that I can keep and what I can freely delete?
I don’t want to have a dataset with doubled size on my pc.

/downloads contains the downloaded data files and /imagenet-1k an .arrow file generated from them (the images are in JPEG, so it’s hard to compress them further in this conversion from TAR to Arrow). Hence, the total size is twice the original dataset’s size.

Deleting /downloads should work.

PS: Calling ds.cleanup_cache_files deletes all the dataset’s cached .arrow files besides ds.cache_files (the ones that are memory-mapped)

1 Like