Datasets filter/map hangs when multithreading

Hi all, I am trying to use filter/map on a dataset in a script, and I am finding that the script hangs upon completion of the filter/map operations (the tqdm progress bar always goes up to 100%)

I am calling filter like my_dataset = my_dataset.filter(lambda example: example['image_id'] not in some_set, num_proc=32).

I tracked down where the code is hanging using faulthandler.dump_traceback(), whose output is:

Thread 0x00002b0014ec1700 (most recent call first):
  File ".../lib/python3.8/threading.py", line 306 in wait
  File ".../lib/python3.8/threading.py", line 558 in wait
  File "~/virtualenv/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File ".../lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File ".../lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00002b0af28f1740 (most recent call first):
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 27 in poll
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/popen_fork.py", line 47 in wait
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/process.py", line 149 in join
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/pool.py", line 729 in _terminate_pool
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/util.py", line 224 in __call__
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/pool.py", line 654 in terminate
  File "~/virtualenv/lib/python3.8/site-packages/multiprocess/pool.py", line 736 in __exit__
  File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3105 in map
  File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528 in wrapper
  File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563 in wrapper
  File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3531 in filter
  File "~/virtualenv/lib/python3.8/site-packages/datasets/fingerprint.py", line 511 in wrapper
  File "~/virtualenv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528 in wrapper
  File "my_script.py", line 145 in encode -> line where filter is called

(I believe the thread 0x00002b0014ec1700 corresponds to tqdm. When I disable tqdm, I only get the stack trace from a single thread analogous to the thread 0x00002b0af28f1740 shown above.)

So based on the stack trace from the second thread 0x00002b0af28f1740, the script stops at datasets/arrow_dataset.py at fd893098627230cc734f6009ad04cf885c979ac4 · huggingface/datasets · GitHub, i.e. when the pool object is destroyed by the context manager for going out of scope.

I would appreciate any help on why this occurs / how to fix. I am using python 3.8, datasets 2.11, pyarrow 10.0.1.

Thanks in advance!