Map() function freezes on large dataset

BelalElhossany · March 7, 2023, 9:27am

Hi, I’m trying to use map on a dataset of size about 100GB, it hangs every time. I tried a lot of parameters combinations but it always hangs.
I cannot even use for loop, values of the dictionary are not modified in a loop. But, the for loop doesn’t hang it only has no effect.

I think the problem is in the I/O operations done in the map function, but I don’t know what the problem exactly is.

This is the last error after hanging for 15 mins at 10%

mariosasko · March 15, 2023, 1:42pm

Hi! This error seems to come from ipython bug, not from datasets. It would help if you could verify this by running the code outside the ipython env.

BelalElhossany · March 15, 2023, 2:01pm

@mariosasko I did, it ran to completion but I think there is a huge overhead.
I can iterate on the dataset in 10 mins, and it takes 30 mins to save the dataset on the disk. but the map function took about 4 hrs to finish.
I don’t know what the problem is exactly but I think it’s a disk bottleneck as map tries to read data from the disk and save to the cache simultaneously.

mariosasko · March 15, 2023, 2:09pm

You can probably make the map faster by reducing the number of processes to, e.g., num_proc=os.cpu_count().

BelalElhossany · March 15, 2023, 2:17pm

@mariosasko Great, thanks!

whoismikha · September 8, 2023, 6:28pm

Is there a way to find out more precisely what the bottleneck is? My num_proc is a t cpu_count but it still takes unreasonably long to save at the end of processing (or when writer_batch_size is reached.

Is it possible to disable the saving to disk entirely and only keep the map result in memory? keep_in_memory sounds like it does that but doesn’t seem to actually do it.

whoismikha · September 8, 2023, 6:31pm

for me it for some reason only happens when I create PIL image columns during the mapping fn.
If instead I convert PIL images to bytes, the map doesn’t freeze. Same exact dataset.

@lhoestq do you have ideas?

lhoestq · September 10, 2023, 11:13am

Could you share the code of the function you pass to map ? And also the version where you convert PIL Images to bytes ?

Can you also share some information about your operating system (windows/max/linux ?) and your version of pyarrow and datasets ?

ReatKay · September 10, 2023, 4:49pm

# Disable disk caching
from datasets import set_caching_enabled
set_caching_enabled(False)

Regarding the num_procs: What environment (OS)? And what functionality are u actually using in your map call(s)? Iterating a dataset is one thing, thats simply reading over a dataset, basically: Exactly what a dataset is made for, during mapping you actually do alterations.

Simplest solution: After you did all your mapping, then save your dataset fully processed and usable to disk - and load it already mapped and usable from disk when you need it.

Finally NVMe can really shine -laugh-

Topic		Replies	Views
Datasets map keeps hanging Beginners	0	680	April 7, 2024
Progress bar of dataset.map with num_proc>1 hangs 🤗Datasets	2	1266	December 6, 2023
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1977	April 22, 2022
Dataset map function takes forever to run! 🤗Datasets	16	6673	August 15, 2024
Working with large datasets - cache issues 🤗Datasets	1	1027	June 1, 2022

Map() function freezes on large dataset

Related topics