After I’ve loaded ‘imagenet-1k’ with load_dataset("imagenet-1k")
I found that I have two main huge folders in HF_DATASETS_CACHE
:
- datasets/imagenet-1k which about 155GB
- datasets/downloads which takes about 154GB
Also when I run ds['train'].size_in_bytes/1024**3
I get approximately 310GB which means, I guess, a size of the dataset.
But the total size of files in imagenet repo is about 160GB.
If I run ds.cleanup_cache_files()
it doesn’t remove anything.
So my question is what are the actual dataset files that I can keep and what I can freely delete?
I don’t want to have a dataset with doubled size on my pc.