Working with large datasets - cache issues

vitvit · November 14, 2021, 1:09pm

The .map() function creates a cache file 100 times larger then the original dataset file.
Can this behaviour be somehow avoided?
Note that it happens when using “num_proc=48”

woqucc · June 1, 2022, 2:54am

I also encountered this today with datasets=2.2.1 and I’m looking for a solution for that. I processed a 3.7G JSON file by the map function with ‘num_proc=32’, and more than 300G arrow files were created in the ‘/tmp’ dictionary. Of course, I got a No space on the device error.

Topic		Replies	Views
Dataset map() creates lot of cache files 🤗Datasets	6	6664	March 26, 2024
Increase on disk space when using map() in Accelerate environment 🤗Datasets	2	1185	August 18, 2022
Help! HuggingFace dataset.map() creates unreachable temp files that fill up disks Beginners	1	1094	May 15, 2023
Map() function freezes on large dataset 🤗Datasets	8	3103	September 10, 2023
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2745	March 22, 2023

Working with large datasets - cache issues

Related topics