The .map() function creates a cache file 100 times larger then the original dataset file.
Can this behaviour be somehow avoided?
Note that it happens when using “num_proc=48”
I also encountered this today with datasets=2.2.1 and I’m looking for a solution for that. I processed a 3.7G JSON file by the map function with ‘num_proc=32’, and more than 300G arrow files were created in the ‘/tmp’ dictionary. Of course, I got a No space on the device error.