Map() function freezes on large dataset

Hi, I’m trying to use map on a dataset of size about 100GB, it hangs every time. I tried a lot of parameters combinations but it always hangs.
I cannot even use for loop, values of the dictionary are not modified in a loop. But, the for loop doesn’t hang it only has no effect.

I think the problem is in the I/O operations done in the map function, but I don’t know what the problem exactly is.

This is the last error after hanging for 15 mins at 10%

Hi! This error seems to come from ipython bug, not from datasets. It would help if you could verify this by running the code outside the ipython env.

@mariosasko I did, it ran to completion but I think there is a huge overhead.
I can iterate on the dataset in 10 mins, and it takes 30 mins to save the dataset on the disk. but the map function took about 4 hrs to finish.
I don’t know what the problem is exactly but I think it’s a disk bottleneck as map tries to read data from the disk and save to the cache simultaneously.

You can probably make the map faster by reducing the number of processes to, e.g., num_proc=os.cpu_count().

@mariosasko Great, thanks!