Map fails for more than 4 processes

jayjha · October 14, 2023, 6:10pm

Hi, I want to apply a company internal transformer based NLP model on the rows in a hf dataset. It works fine when i use map with num_proc <=4. For values grater than it errors out with the following stacktrace:

Internal error

Traceback (most recent call last):
File “/flow/metaflow/metaflow/cli.py”, line 1172, in main
start(auto_envvar_prefix=“METAFLOW”, obj=state)
File “/flow/metaflow/metaflow/_vendor/click/core.py”, line 829, in call
return self.main(args, kwargs)
File “/flow/metaflow/metaflow/_vendor/click/core.py”, line 782, in main
rv = self.invoke(ctx)
File “/flow/metaflow/metaflow/_vendor/click/core.py”, line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/flow/metaflow/metaflow/_vendor/click/core.py”, line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File “/flow/metaflow/metaflow/_vendor/click/core.py”, line 610, in invoke
return callback(args, kwargs)
File “/flow/metaflow/metaflow/_vendor/click/decorators.py”, line 21, in new_func
return f(get_current_context(), args, kwargs)
File “/flow/metaflow/metaflow/cli.py”, line 581, in step
task.run_step(
File “/flow/metaflow/metaflow/task.py”, line 583, in run_step
self._exec_step_function(step_func)
File “/flow/metaflow/metaflow/task.py”, line 57, in _exec_step_function
step_function()
File “flow.py”, line 196, in hit_model
result_dataset = dataset.map(process_row,num_proc=16, batch_size=30)
File “/usr/local/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 592, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, args, kwargs)
File “/usr/local/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 557, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, args, kwargs)
File “/usr/local/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 3189, in map
for rank, done, content in iflatmap_unordered(
File “/usr/local/lib/python3.8/site-packages/datasets/utils/py_utils.py”, line 1387, in iflatmap_unordered
raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
Below is the code which is a step in a metaflow pipeline. we have 32 cpus and 22gigs of memory.

        dataset = load_dataset('/root/dir/',num_proc=32)

        MODEL = Model()

        def process_row(row):
            try:
                result = MODEL.run(row)
            except:
                result = {}
            return result

        result_dataset =  dataset.map(process_row,num_proc=16, batch_size=30)

mariosasko · October 17, 2023, 10:50pm

It’s impossible to recover the exception from a failed multiprocess process, so set num_proc=None and run the map on a smaller subset to get an error causing one (or more) of the processes to die.

concurrent.futures.ProcessPoolExecutor can recover such errors but is limited in terms of what it can serialize, so we cannot make the switch.

jayjha · October 27, 2023, 4:49pm

Hi,
The code runs fine for num_proc=1 or if I spin up really big machine and not use all the processes.
I think my problem is similar to this issue discussed where the memory keeps on increasing. Was this ever resolved? is there a new suggested solution other than this?

rokayabn · December 12, 2023, 10:21am

hi did you find any solution ?

Dakhoo · January 22, 2024, 1:04pm

I have the same problem

boringtaskai · May 10, 2024, 4:21pm

I have the same issue

blackhole33 · January 23, 2025, 6:48am

change num_proc value, that is, reduce !

ddnick07 · April 9, 2025, 5:48am

i had facing another error after changed dataset_num_proc = 2 to 1 and the error message was,
BackendCompilerFailed: backend=‘inductor’ raised:
PermissionError: [WinError 5] Access is denied: ‘C:\Users\Bespoke\AppData\Local\Temp\torchinductor_bespoke\triton\0\tmp.8bdc634d-2535-4980-b1ee-7f7a15d6bbca’ → ‘C:\Users\Bespoke\AppData\Local\Temp\torchinductor_bespoke\triton\0\C42IDn1tsawzf3xVScRF1c7d9fi0Dt73-uD1a5lcLck’

Set TORCH_LOGS=“+dynamo” and TORCHDYNAMO_VERBOSE=1 for more information

Topic		Replies	Views
Map with num_proc over 1 fails 🤗Datasets	1	167	April 24, 2024
Issue of multiprocessing in map function 🤗Datasets	2	334	March 18, 2024
Dataset map function takes forever to run! 🤗Datasets	16	6714	August 15, 2024
Map multiprocessing Issue 🤗Datasets	31	17664	July 16, 2024
Using num_proc>1 in Dataset.map hangs 🤗Datasets	8	4011	August 19, 2024

Map fails for more than 4 processes

Related topics