Map() function freezes on large dataset

Hi, I’m trying to use map on a dataset of size about 100GB, it hangs every time. I tried a lot of parameters combinations but it always hangs.
I cannot even use for loop, values of the dictionary are not modified in a loop. But, the for loop doesn’t hang it only has no effect.

I think the problem is in the I/O operations done in the map function, but I don’t know what the problem exactly is.

This is the last error after hanging for 15 mins at 10%

Hi! This error seems to come from ipython bug, not from datasets. It would help if you could verify this by running the code outside the ipython env.

@mariosasko I did, it ran to completion but I think there is a huge overhead.
I can iterate on the dataset in 10 mins, and it takes 30 mins to save the dataset on the disk. but the map function took about 4 hrs to finish.
I don’t know what the problem is exactly but I think it’s a disk bottleneck as map tries to read data from the disk and save to the cache simultaneously.

You can probably make the map faster by reducing the number of processes to, e.g., num_proc=os.cpu_count().

@mariosasko Great, thanks!

Is there a way to find out more precisely what the bottleneck is? My num_proc is a t cpu_count but it still takes unreasonably long to save at the end of processing (or when writer_batch_size is reached.

Is it possible to disable the saving to disk entirely and only keep the map result in memory? keep_in_memory sounds like it does that but doesn’t seem to actually do it.

for me it for some reason only happens when I create PIL image columns during the mapping fn.
If instead I convert PIL images to bytes, the map doesn’t freeze. Same exact dataset.

@lhoestq do you have ideas?

Could you share the code of the function you pass to map ? And also the version where you convert PIL Images to bytes ?

Can you also share some information about your operating system (windows/max/linux ?) and your version of pyarrow and datasets ?

# Disable disk caching
from datasets import set_caching_enabled

Regarding the num_procs: What environment (OS)? And what functionality are u actually using in your map call(s)? Iterating a dataset is one thing, thats simply reading over a dataset, basically: Exactly what a dataset is made for, during mapping you actually do alterations.

Simplest solution: After you did all your mapping, then save your dataset fully processed and usable to disk - and load it already mapped and usable from disk when you need it.

Finally NVMe can really shine -laugh-