# Map multiprocessing Issue

I’m getting this issue when I am trying to map-tokenize a large custom data set. Looks like a multiprocessing issue. Running it with one proc or with a smaller set it seems work. I’ve tried different batch_size and still get the same errors. I also tried sharding it into smaller data sets, but that didn’t help. Thoughts? Thanks!

dataset[‘test’].map(lambda e: tokenizer(e[‘texts’]), batched = True, batch_size = 1000, num_proc = 8)

error Traceback (most recent call last)
in
----> 1 dataset[‘test’].map(lambda e: tokenizer(e[‘texts’]), batched = True, batch_size = 1000, num_proc = 8)

/home/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint)
1316 logger.info(“Spawning {} processes”.format(num_proc))
1317 results = [pool.apply_async(self.class._map_single, kwds=kwds) for kwds in kwds_per_shard]
→ 1318 transformed_shards = [r.get() for r in results]
1319 logger.info(“Concatenating {} shards from multiprocessing”.format(num_proc))
1320 result = concatenate_datasets(transformed_shards)

/home/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py in (.0)
1316 logger.info(“Spawning {} processes”.format(num_proc))
1317 results = [pool.apply_async(self.class._map_single, kwds=kwds) for kwds in kwds_per_shard]
→ 1318 transformed_shards = [r.get() for r in results]
1319 logger.info(“Concatenating {} shards from multiprocessing”.format(num_proc))
1320 result = concatenate_datasets(transformed_shards)

/home/venv/lib/python3.6/site-packages/multiprocess/pool.py in get(self, timeout)
642 return self._value
643 else:
→ 644 raise self._value
645
646 def _set(self, i, obj):

422 break
423 try:
425 except Exception as e:

/home/venv/lib/python3.6/site-packages/multiprocess/connection.py in send(self, obj)
207 self._check_closed()
208 self._check_writable()
→ 209 self._send_bytes(_ForkingPickler.dumps(obj))
210
211 def recv_bytes(self, maxlength=None):

/home/venv/lib/python3.6/site-packages/multiprocess/connection.py in _send_bytes(self, buf)
394 n = len(buf)
395 # For wire compatibility with 3.2 and lower
→ 396 header = struct.pack("!i", n)
397 if n > 16384:
398 # The payload is large so Nagle’s algorithm won’t be triggered

error: ‘i’ format requires -2147483648 <= number <= 2147483647

Hi there, I got a (maybe) similar issue caused by the multiprocessing in map. Instead of opening a new thread, I thought I would use this one. Note that the error occurs only if I specify num_proc > 1, i.e. use multi-processing:

Code:

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

datasets = datasets.map(
batched=True,
batch_size=1000,
num_proc=2, #psutil.cpu_count()
remove_columns=['text'],
)

datasets


Error:

Token indices sequence length is longer than the specified maximum sequence length for this model (8395 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\multiprocess\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py", line 203, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\fingerprint.py", line 337, in wrapper
out = func(self, *args, **kwargs)
File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py", line 1695, in _map_single
batch = apply_function_on_filtered_inputs(
File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py", line 1608, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "<ipython-input-18-25a1ecec1896>", line 9, in <lambda>
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-18-25a1ecec1896> in <module>
6 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
7
----> 8 datasets = datasets.map(
10     batched=True,

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\dataset_dict.py in map(self, function, with_indices, input_columns, batched, batch_size, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc)
430             cache_file_names = {k: None for k in self}
431         return DatasetDict(
--> 432             {
433                 k: dataset.map(
434                     function=function,

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\dataset_dict.py in <dictcomp>(.0)
431         return DatasetDict(
432             {
--> 433                 k: dataset.map(
434                     function=function,
435                     with_indices=with_indices,

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint)
1483                 logger.info("Spawning {} processes".format(num_proc))
1484                 results = [pool.apply_async(self.__class__._map_single, kwds=kwds) for kwds in kwds_per_shard]
-> 1485                 transformed_shards = [r.get() for r in results]
1486                 logger.info("Concatenating {} shards from multiprocessing".format(num_proc))
1487                 result = concatenate_datasets(transformed_shards)

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py in <listcomp>(.0)
1483                 logger.info("Spawning {} processes".format(num_proc))
1484                 results = [pool.apply_async(self.__class__._map_single, kwds=kwds) for kwds in kwds_per_shard]
-> 1485                 transformed_shards = [r.get() for r in results]
1486                 logger.info("Concatenating {} shards from multiprocessing".format(num_proc))
1487                 result = concatenate_datasets(transformed_shards)

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\multiprocess\pool.py in get(self, timeout)
769             return self._value
770         else:
--> 771             raise self._value
772
773     def _set(self, i, obj):

NameError: name 'tokenizer' is not defined


I am grateful for any help!

Best solution I have found so far for my issue is to make it explicitly load the data set into RAM to avoid pickling. This does the trick, but does require a lot of RAM.

dataset = dataset.flatten_indices()


Not sure if it will solve your problem!

Hi, thanks for the reply! Unfortunately, this will not be an option as the dataset is indeed very large…

Can you shard it and load each shard individually into ram then multiprocess just that batch?

Hi ! It looks like something is not correctly pickled for multiprocessing.
Can you try to update dill and multiprocess ?
Can you try to use dill.dumps on your tokenizer or your dataset and see if it raises an error ?

Are you sure your tokenizer is defined ?
Can you re-run your notebook and try again ?

If the issues persist, I’d be happy to help you figure what’s wrong. In this case feel free to share a script or a google colab that reproduces this issue

Tried to re-run, cleaned the .cache folder, all to the same end. The error remains. When I start the script the CPU peaks for a brief moment, until the error occurs. Let me try to provide you some more context:

txt_files = [str(i) for i in PATH_DATA.glob('all_*')]

path='text',
data_files=txt_files[0],
split=None,
)

#datasets['train'] = datasets['train'].select(range(1000))
datasets['train'], datasets['test'] = datasets['train'].train_test_split(test_size=0.02, shuffle=True, seed=1234).values()
datasets['train'], datasets['val']  = datasets['train'].train_test_split(test_size=0.02, shuffle=True, seed=1234).values()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

datasets = datasets.map(
batched=True,
batch_size=1000,
num_proc=psutil.cpu_count(),
remove_columns=['text'],
)


Output of the dataset after splitting:

DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 210759
})
test: Dataset({
features: ['text'],
num_rows: 4390
})
val: Dataset({
features: ['text'],
num_rows: 4302
})
})


Note that I only load text files from txt_files . MODEL_NAME equals 'roberta-base'. I am note quite sure what you refer to in terms of dill.dumps tho. Sorry, I am still relatively new to the ecosystem …

EDIT1: As indicated, if I set num_proc=1 the code runs fluently.

EDIT2: By the way, I run into the same issue if I use other lambda functions that involve anything that is external to the mapping procedure. For example, I tried to debug it by simply returning a dict with a np.ndarray() as required without acutally performing any pre-processing on the initial text. However, then it told me that name 'numpy' is not defined

EDIT3: Uninstalling and installing dill and multiprocess did not help. Also, dill.dumps() works fine on both, the tokenizer and dataset - or least it does not throw and error but returns a lot of cryptic stuff.

EDIT4: I can’t reproduce this error in a colab environment. Likely something local?

This looks like an issue with the environment.
Does this work if you run the code from the command line instead of jupyter/ipython ?

Thanks for the reply! When running it in the command line within the .venv the same error occurs… Should I try to run it outside the virtual env as well?

EDIT: It returns the same error if I run it in cli outside the virtual env:

>>> tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
>>> tokenizer

>>> datasets = datasets.map(
...     batched=True,
...     batch_size=10,
...     num_proc=2,#psutil.cpu_count(),
...     remove_columns=['text'],
... )
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
#1:   0%|                                                                                                                | 0/50 [00:00<?, ?ba/s]
#0:   0%|                                                                                                                | 0/50 [00:00<?, ?ba/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Python\lib\site-packages\multiprocess\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 203, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Python\lib\site-packages\datasets\fingerprint.py", line 337, in wrapper
out = func(self, *args, **kwargs)
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1695, in _map_single
batch = apply_function_on_filtered_inputs(
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1608, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "<stdin>", line 2, in <lambda>
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
{
File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
k: dataset.map(
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in map
transformed_shards = [r.get() for r in results]
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in <listcomp>
transformed_shards = [r.get() for r in results]
File "C:\Python\lib\site-packages\multiprocess\pool.py", line 771, in get
raise self._value
NameError: name 'tokenizer' is not defined


Could you try by defining an actual function instead of using a lambda ?

Also tried this as well, and again I ran into the very same error…

EDIT: So I transfered all the code from the Jupyter Notebook into a plain .py script and let the relevant part run again, now actually receiving a different error message. Not sure tho, if this may help to come closer to the issue

Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "<string>", line 1, in <module>
Traceback (most recent call last):
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 116, in spawn_main
File "<string>", line 1, in <module>
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
exitcode = _main(fd, parent_sentinel)
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 125, in _main
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 236, in prepare
prepare(preparation_data)
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 287, in _fixup_main_from_path
_fixup_main_from_path(data['init_main_from_path'])
main_content = runpy.run_path(main_path,  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 287, in _fixup_main_from_path

File "C:\Python\lib\runpy.py", line 265, in run_path
main_content = runpy.run_path(main_path,
File "C:\Python\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,    return _run_module_code(code, init_globals, run_name,

File "C:\Python\lib\runpy.py", line 97, in _run_module_code
File "C:\Python\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,_run_code(code, mod_globals, init_globals,

File "C:\Python\lib\runpy.py", line 87, in _run_code
File "C:\Python\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)exec(code, run_globals)

File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\Untitled-1.py", line 41, in <module>
File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\Untitled-1.py", line 41, in <module>
datasets = datasets.map(datasets = datasets.map(

File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
{{

File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
k: dataset.map(
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1452, in map
k: dataset.map(
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1452, in map
with Pool(num_proc, initargs=(RLock(),), initializer=tqdm.set_lock) as pool:with Pool(num_proc, initargs=(RLock(),), initializer=tqdm.set_lock) as pool:

File "C:\Python\lib\site-packages\multiprocess\context.py", line 119, in Pool
File "C:\Python\lib\site-packages\multiprocess\context.py", line 119, in Pool

File "C:\Python\lib\site-packages\multiprocess\pool.py", line 212, in __init__
File "C:\Python\lib\site-packages\multiprocess\pool.py", line 212, in __init__
self._repopulate_pool()self._repopulate_pool()

File "C:\Python\lib\site-packages\multiprocess\pool.py", line 303, in _repopulate_pool
File "C:\Python\lib\site-packages\multiprocess\pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
return self._repopulate_pool_static(self._ctx, self.Process,  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 326, in _repopulate_pool_static

File "C:\Python\lib\site-packages\multiprocess\pool.py", line 326, in _repopulate_pool_static
w.start()
File "C:\Python\lib\site-packages\multiprocess\process.py", line 121, in start
w.start()
File "C:\Python\lib\site-packages\multiprocess\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Python\lib\site-packages\multiprocess\context.py", line 327, in _Popen
self._popen = self._Popen(self)
File "C:\Python\lib\site-packages\multiprocess\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Python\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
return Popen(process_obj)
File "C:\Python\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
_check_not_importing_main()
File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError: raise RuntimeError('''
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

if __name__ == '__main__':
freeze_support()
...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

if __name__ == '__main__':
freeze_support()
...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.


This usually happens if you use a script without using

if __name__ == "__main__":
...


This is because you don’t want your processes to be created when your module is imported.
Can you try again using this ?

from datasets import load_dataset

if __name__ == "__main__":
...


@lhoestq thanks for the reply! Indeed, this fixes the RuntimeError in the previous reply. However, again it results in the all so familiar error from before…

Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
#1:   0%|                                                                                                                                                                                                                    | 0/50 [00:00<?, ?ba/s]
#0:   0%|                                                                                                                                                                                                                    | 0/50 [00:00<?, ?ba/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Python\lib\site-packages\multiprocess\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 203, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Python\lib\site-packages\datasets\fingerprint.py", line 337, in wrapper
out = func(self, *args, **kwargs)
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1695, in _map_single
batch = apply_function_on_filtered_inputs(
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1608, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "c:/Users/s_scho53/Desktop/L09_Desktop/_FiLMo/Untitled-1.py", line 40, in encode
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "c:/Users/s_scho53/Desktop/L09_Desktop/_FiLMo/Untitled-1.py", line 42, in <module>
datasets = datasets.map(
File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
{
File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
k: dataset.map(
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in map
transformed_shards = [r.get() for r in results]
File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in <listcomp>
transformed_shards = [r.get() for r in results]
File "C:\Python\lib\site-packages\multiprocess\pool.py", line 771, in get
raise self._value
NameError: name 'tokenizer' is not defined


Can you share the script that you used ? I’ll try to reproduce on my side

Wrote you a DM

1 Like

In case anyone ever encounters a similar or related issue in the future: What eventually helped me solve the issue was to shift the definition of the tokenizer outside of __name__ == '__main__':. This enabled the proper serialization and unpickling of the tokenizer in every sub-process. Thanks to @lhoestq for proposing the fix!

Do you all know of a good demo that explains how this all works? I have about 7 mil tweets I’m classifying that I uploaded onto datasets as csv’s. I’m not sure exactly how incorporate the uploaded datasets with my classifier or pipeline. Thanks!

If you are using a pipeline from transformers you can just pass the text column to the pipeline

results = my_classification_pipeline(my_dataset["text_column"])

Hi, could you please elaborate on how you fixed the issue? I have the exact same problem now and I cannot fix