Hi all, I am trying to use filter/map on a dataset in a script, and I am finding that the script hangs upon completion of the filter/map operations (the tqdm progress bar always goes up to 100%)
I am calling filter like my_dataset = my_dataset.filter(lambda example: example['image_id'] not in some_set, num_proc=32)
.
I tracked down where the code is hanging using faulthandler.dump_traceback()
, whose output is:
Thread 0x00002b0014ec1700 (most recent call first):
File ".../lib/python3.8/threading.py", line 306 in wait
File ".../lib/python3.8/threading.py", line 558 in wait
File "~/virtualenv/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
File ".../lib/python3.8/threading.py", line 932 in _bootstrap_inner
File ".../lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00002b0af28f1740 (most recent call first):
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 27 in poll
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 47 in wait
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/process.py", line 149 in join
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/pool.py", line 729 in _terminate_pool
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/util.py", line 224 in __call__
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/pool.py", line 654 in terminate
File "~/virtualenv/lib/python3.8/site-packages/multiprocess/pool.py", line 736 in __exit__
File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3105 in map
File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528 in wrapper
File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563 in wrapper
File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3531 in filter
File "~/virtualenv/lib/python3.8/site-packages/datasets/fingerprint.py", line 511 in wrapper
File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528 in wrapper
File "my_script.py", line 145 in encode -> line where filter is called
(I believe the thread 0x00002b0014ec1700 corresponds to tqdm. When I disable tqdm, I only get the stack trace from a single thread analogous to the thread 0x00002b0af28f1740 shown above.)
So based on the stack trace from the second thread 0x00002b0af28f1740, the script stops at datasets/arrow_dataset.py at fd893098627230cc734f6009ad04cf885c979ac4 · huggingface/datasets · GitHub, i.e. when the pool object is destroyed by the context manager for going out of scope.
I would appreciate any help on why this occurs / how to fix. I am using python 3.8, datasets 2.11, pyarrow 10.0.1.
Thanks in advance!