Dataset map function takes forever to run!

I’m trying to pre-process my dataset for the Donut model and despite completeing the mapping it is running for about 100 mins -.-. I ran this with num_proc=2, not sure if setting it to all cpu cores would make much of a difference.

Any idea of how to fix this?

2 Likes

Hi! What does processor.tokenizer.is_fast return? If the returned value is True, it’s better not to use the num_proc parameter in map to benefit from the tokenizer’s parallelism. The “fast” tokenizers are written in Rust and process data in parallel by default, but this does work well in multi-process Python code, so we disable the “fast” tokenizers’ parallelism when num_proc>1 to avoid deadlocks.

Also, setting the return_tensors parameter to np should make the transform faster as PyArrow natively supports NumPy 1-D arrays, which avoids the torchnp conversion step.

2 Likes

Thanks @mariosasko. Yes, the tokenizer is fast. The reason I ran this num_proc>1 because without it the code ate up all my RAM(32gb) and the kernel kept dying.

Was this problem ever solved? I played around with different settings mentioned in this thread and it changes nothing.

dataset.map stalls out around 80% of the dataset completion and then gets steadily slower and slower until it pretty much stops at 97% complete.

I tried to cut a smaller piece of the dataset to try, but this behavior persists on a smaller set too.

At the end it seems only one process is running and the rest are idle.

After 27 hours I just ctl-C out of it.

@Pensive Even with num_proc=None (or num_proc=1)? If that’s the case, can you interrupt the map (with num_proc=1) and paste the printed error stack trace?

With None or 1 it is slow from the get go. If I interrupt it, it is somewhere in the multiprocessing.contex.timeout.

It is not even consistent.

If I put num_proc=30, a lot of time it stalls right away, and is slow like with None or 1

Sometimes, it starts pretty quickly runs to 80% slows down, and then is extremely slow and never finishes.

When I interrupted at that point, the trace had a line about being in the semaphore function.

Traceback (most recent call last):
File “preprocess.py”, line 81, in
dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=30, remove_columns=[“text”, “meta”])
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 592, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 557, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 3189, in map
for rank, done, content in iflatmap_unordered(
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/pool.py”, line 767, in get
raise TimeoutError
multiprocess.context.TimeoutError

KeyboardInterrupt
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/process.py”, line 315, in _bootstrap
self.run()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/pool.py”, line 114, in worker
task = get()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/queues.py”, line 358, in get
with self._rlock:
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/synchronize.py”, line 101, in enter
return self._semlock.enter()
KeyboardInterrupt
Map (num_proc=16): 82%|███████████████████████████████████████████████████████████████████████████████████▍ | 7614/9305 [00:28<00:06, 262.96 examples/s]
Traceback (most recent call last):
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1380, in iflatmap_unordered
yield queue.get(timeout=0.05)
File “”, line 2, in get
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/managers.py”, line 835, in _callmethod
kind, result = conn.recv()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/connection.py”, line 253, in recv
buf = self._recv_bytes()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/connection.py”, line 417, in _recv_bytes
buf = self._recv(4)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/connection.py”, line 382, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “preprocess.py”, line 81, in
dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=[“text”, “meta”])
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 592, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 557, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 3189, in map
for rank, done, content in iflatmap_unordered(
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/pool.py”, line 767, in get
raise TimeoutError
multiprocess.context.TimeoutError

1 Like

Hey @mariosasko @Pensive , is that problem solved .

I know what the problem is and how to overcome it, but NOT solved in the dataset.map.

1 Like

So, how can I overcome it?
Just met the same problem, with the latest library version. I’ve tried everything I could find, but nothing has worked :confused:

The problem lies in the dependence of dataset.map on multiprocessing process pool. The standard process pool uses “fork” to create new processes. Depending on timing
when processes are created using “fork,” locks sometimes get copied in a locked state, thus causing deadlocks. You would have to use “forkserver” instead to prevent this from happening.

I experienced a similar slowdown while processing a very large dataset (LAION-2B). Script started processing very fast but eventually hanged at around 40%. I discovered the issue was related to having too many open files (limit on my instance was 512), while I had 256 parallel processes (probably with more than 2 files open per process). By scaling down processes to 128, script worked fine, while raising open file limits with ulimit didn’t work. I hope that this will be helpful.

1 Like

Same here: we have an audio dataset with 9 rows. When call dataset.map(…) with 2 processes then the map function starts to execute but stalls immeadelty at the first code line inside the function to be executed - even if it is just a print statement.

We use latest transformers version v4.42.3 and dataset version 2.20.0

@mariosasko any ideas?

I had the same problem when I do mapping for an audio dataset for num_proc>1. I traced the function and found it stucks when big audio array (big means even 1 second audio arrays) is just accessed (even just printing or assigning to a variable in function) but if another column of dataset that does not contain big audio array (like text of audio) is accessed or modified, it works in multiprocess mode properly.
After days of struggling, accidentally I found by getting one of the samples of dataset by next(iter(dataset)) just before the dataset.map(), it works properly on muliple processors.
It is also recommended to run your python code with torchrun yourcode.py, while python yourcode.py would run the code but some SIGTERM logs of some processors are printed.

1 Like

Super, many thanks @Hannan !