Dataset map function takes forever to run!

I’m trying to pre-process my dataset for the Donut model and despite completeing the mapping it is running for about 100 mins -.-. I ran this with num_proc=2, not sure if setting it to all cpu cores would make much of a difference.

Any idea of how to fix this?

Hi! What does processor.tokenizer.is_fast return? If the returned value is True, it’s better not to use the num_proc parameter in map to benefit from the tokenizer’s parallelism. The “fast” tokenizers are written in Rust and process data in parallel by default, but this does work well in multi-process Python code, so we disable the “fast” tokenizers’ parallelism when num_proc>1 to avoid deadlocks.

Also, setting the return_tensors parameter to np should make the transform faster as PyArrow natively supports NumPy 1-D arrays, which avoids the torchnp conversion step.

1 Like

Thanks @mariosasko. Yes, the tokenizer is fast. The reason I ran this num_proc>1 because without it the code ate up all my RAM(32gb) and the kernel kept dying.

Was this problem ever solved? I played around with different settings mentioned in this thread and it changes nothing.

dataset.map stalls out around 80% of the dataset completion and then gets steadily slower and slower until it pretty much stops at 97% complete.

I tried to cut a smaller piece of the dataset to try, but this behavior persists on a smaller set too.

At the end it seems only one process is running and the rest are idle.

After 27 hours I just ctl-C out of it.

@Pensive Even with num_proc=None (or num_proc=1)? If that’s the case, can you interrupt the map (with num_proc=1) and paste the printed error stack trace?

With None or 1 it is slow from the get go. If I interrupt it, it is somewhere in the multiprocessing.contex.timeout.

It is not even consistent.

If I put num_proc=30, a lot of time it stalls right away, and is slow like with None or 1

Sometimes, it starts pretty quickly runs to 80% slows down, and then is extremely slow and never finishes.

When I interrupted at that point, the trace had a line about being in the semaphore function.

Traceback (most recent call last):
File “preprocess.py”, line 81, in
dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=30, remove_columns=[“text”, “meta”])
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 592, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 557, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 3189, in map
for rank, done, content in iflatmap_unordered(
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/pool.py”, line 767, in get
raise TimeoutError
multiprocess.context.TimeoutError

KeyboardInterrupt
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/process.py”, line 315, in _bootstrap
self.run()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/pool.py”, line 114, in worker
task = get()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/queues.py”, line 358, in get
with self._rlock:
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/synchronize.py”, line 101, in enter
return self._semlock.enter()
KeyboardInterrupt
Map (num_proc=16): 82%|███████████████████████████████████████████████████████████████████████████████████▍ | 7614/9305 [00:28<00:06, 262.96 examples/s]
Traceback (most recent call last):
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1380, in iflatmap_unordered
yield queue.get(timeout=0.05)
File “”, line 2, in get
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/managers.py”, line 835, in _callmethod
kind, result = conn.recv()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/connection.py”, line 253, in recv
buf = self._recv_bytes()
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/connection.py”, line 417, in _recv_bytes
buf = self._recv(4)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/connection.py”, line 382, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “preprocess.py”, line 81, in
dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=[“text”, “meta”])
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 592, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 557, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 3189, in map
for rank, done, content in iflatmap_unordered(
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1394, in
[async_result.get(timeout=0.05) for async_result in async_results]
File “/home/developer/mambaforge/envs/FinGPT/lib/python3.8/site-packages/multiprocess/pool.py”, line 767, in get
raise TimeoutError
multiprocess.context.TimeoutError

1 Like

Hey @mariosasko @Pensive , is that problem solved .

I know what the problem is and how to overcome it, but NOT solved in the dataset.map.

1 Like

So, how can I overcome it?
Just met the same problem, with the latest library version. I’ve tried everything I could find, but nothing has worked :confused:

The problem lies in the dependence of dataset.map on multiprocessing process pool. The standard process pool uses “fork” to create new processes. Depending on timing
when processes are created using “fork,” locks sometimes get copied in a locked state, thus causing deadlocks. You would have to use “forkserver” instead to prevent this from happening.