Help! HuggingFace dataset.map() creates unreachable temp files that fill up disks

OS: Ubuntu 20 LTS
When I used HuggingFace dataset.map() to process big datasets, its speed degraded very fast and my disk was filled up, then the process crashed. I tried to delete ~/.cache/huggingface, but only reclaimed a small fraction of my disk space (3GB). I searched the internet but could not find any relevant answer.
I had used map() function to process a image dataset.

processed_dataset = dataset.map(
    function=image_feature_extraction_text_tokenization,
    batched=True,
    fn_kwargs={"max_target_length": 256},
    batch_size=1024,
    num_proc=4,
)

Now the computer is unusable, but I have to use it later for other jobs, so I appreciate if you can help me free up spaces.

2 Likes

map creates one temporary file per process, and they are all stored under ~/.cache/huggingface, so deleting this directory should free all the space they occupy.

I think you can avoid the crash by setting the batch_size and writer_batch_size parameters to a lower value to reduce RAM usage (the defaults are 1000 examples)