Issue of multiprocessing in map function

Prajwal-143 · March 15, 2024, 6:21am

tokenizer = Wav2Vec2CTCTokenizer(r"D:\Work\Speech to text\Dataset\tamil_voice\Processed csv\vocab.json", unk_token=“[UNK]”, pad_token=“[PAD]”, word_delimiter_token=“|”)
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

def prepare_dataset(batch):
audio = batch[“path”]

# batched output is "un-batched"
batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
batch["input_length"] = len(batch["input_values"])

with processor.as_target_processor():
    batch["labels"] = processor(batch["sentence"]).input_ids
return batch

common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names,num_proc= 4)
when i try to call the function i get below mentioned error which occurs only if i enter num_proc > 1.

How can i solve this error ?

Error :

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
“”"
Traceback (most recent call last):
File “C:\Users\Lenovo\anaconda3\envs\torch\lib\site-packages\multiprocess\pool.py”, line 125, in worker
result = (True, func(*args, **kwds))
File “C:\Users\Lenovo\anaconda3\envs\torch\lib\site-packages\datasets\utils\py_utils.py”, line 623, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File “C:\Users\Lenovo\anaconda3\envs\torch\lib\site-packages\datasets\arrow_dataset.py”, line 3458, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File “C:\Users\Lenovo\anaconda3\envs\torch\lib\site-packages\datasets\arrow_dataset.py”, line 3361, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “C:\Users\Lenovo\AppData\Local\Temp\ipykernel_6536\430615538.py”, line 5, in prepare_dataset
NameError: name ‘processor’ is not defined
“”"

The above exception was the direct cause of the following exception:

NameError Traceback (most recent call last)
Cell In[23], line 1
----> 1 common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names,num_proc= 4)
2 common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names)

File ~\anaconda3\envs\torch\lib\site-packages\datasets\arrow_dataset.py:593, in transmit_tasks..wrapper(*args, **kwargs)
591 self: “Dataset” = kwargs.pop(“self”)
592 # apply actual function
→ 593 out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
594 datasets: List[“Dataset”] = list(out.values()) if isinstance(out, dict) else [out]
595 for dataset in datasets:
596 # Remove task templates if a column mapping of the template is no longer valid

File ~\anaconda3\envs\torch\lib\site-packages\datasets\arrow_dataset.py:558, in transmit_format..wrapper(*args, **kwargs)
551 self_format = {
552 “type”: self._format_type,
553 “format_kwargs”: self._format_kwargs,
554 “columns”: self._format_columns,
555 “output_all_columns”: self._output_all_columns,
556 }
557 # apply actual function
→ 558 out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
559 datasets: List[“Dataset”] = list(out.values()) if isinstance(out, dict) else [out]
560 # re-apply format to the output

File ~\anaconda3\envs\torch\lib\site-packages\datasets\arrow_dataset.py:3197, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
3191 logger.info(f"Spawning {num_proc} processes")
3192 with hf_tqdm(
3193 unit=" examples",
3194 total=pbar_total,
3195 desc=(desc or “Map”) + f" (num_proc={num_proc})",
3196 ) as pbar:
→ 3197 for rank, done, content in iflatmap_unordered(
3198 pool, Dataset._map_single, kwargs_iterable=kwargs_per_job
3199 ):
3200 if done:
3201 shards_done += 1

File ~\anaconda3\envs\torch\lib\site-packages\datasets\utils\py_utils.py:663, in iflatmap_unordered(pool, func, kwargs_iterable)
660 finally:
661 if not pool_changed:
662 # we get the result in case there’s an error to raise
→ 663 [async_result.get(timeout=0.05) for async_result in async_results]

File ~\anaconda3\envs\torch\lib\site-packages\datasets\utils\py_utils.py:663, in (.0)
660 finally:
661 if not pool_changed:
662 # we get the result in case there’s an error to raise
→ 663 [async_result.get(timeout=0.05) for async_result in async_results]

File ~\anaconda3\envs\torch\lib\site-packages\multiprocess\pool.py:774, in ApplyResult.get(self, timeout)
772 return self._value
773 else:
→ 774 raise self._value

NameError: name ‘processor’ is not defined

mariosasko · March 15, 2024, 4:56pm

This issue looks similar to I'm trying to fine tune the openai/whisper model from huggingface using jupyter notebook and i keep getting this error · Issue #6276 · huggingface/datasets · GitHub. Can you try my suggested fix?

Prajwal-143 · March 18, 2024, 4:40am

Yeah, I tried but it didn’t work.

Topic		Replies	Views
Map with num_proc over 1 fails 🤗Datasets	1	166	April 24, 2024
Map multiprocessing Issue 🤗Datasets	31	17657	July 16, 2024
Using num_proc>1 in Dataset.map hangs 🤗Datasets	8	4000	August 19, 2024
Dataset map function takes forever to run! 🤗Datasets	16	6697	August 15, 2024
Map fails for more than 4 processes 🤗Datasets	7	3624	April 9, 2025

Issue of multiprocessing in map function

Related topics