Hi, I’m trying to use map on a dataset of size about 100GB, it hangs every time. I tried a lot of parameters combinations but it always hangs.
I cannot even use for loop, values of the dictionary are not modified in a loop. But, the for loop doesn’t hang it only has no effect.
I think the problem is in the I/O operations done in the map function, but I don’t know what the problem exactly is.
This is the last error after hanging for 15 mins at 10%
@mariosasko I did, it ran to completion but I think there is a huge overhead.
I can iterate on the dataset in 10 mins, and it takes 30 mins to save the dataset on the disk. but the map function took about 4 hrs to finish.
I don’t know what the problem is exactly but I think it’s a disk bottleneck as map tries to read data from the disk and save to the cache simultaneously.
Is there a way to find out more precisely what the bottleneck is? My num_proc is a t cpu_count but it still takes unreasonably long to save at the end of processing (or when writer_batch_size is reached.
Is it possible to disable the saving to disk entirely and only keep the map result in memory? keep_in_memory sounds like it does that but doesn’t seem to actually do it.
for me it for some reason only happens when I create PIL image columns during the mapping fn.
If instead I convert PIL images to bytes, the map doesn’t freeze. Same exact dataset.
# Disable disk caching
from datasets import set_caching_enabled
set_caching_enabled(False)
Regarding the num_procs: What environment (OS)? And what functionality are u actually using in your map call(s)? Iterating a dataset is one thing, thats simply reading over a dataset, basically: Exactly what a dataset is made for, during mapping you actually do alterations.
Simplest solution: After you did all your mapping, then save your dataset fully processed and usable to disk - and load it already mapped and usable from disk when you need it.