Map multiprocessing Issue

I’m getting this issue when I am trying to map-tokenize a large custom data set. Looks like a multiprocessing issue. Running it with one proc or with a smaller set it seems work. I’ve tried different batch_size and still get the same errors. I also tried sharding it into smaller data sets, but that didn’t help. Thoughts? Thanks!

dataset[‘test’].map(lambda e: tokenizer(e[‘texts’]), batched = True, batch_size = 1000, num_proc = 8)


error Traceback (most recent call last)
in
----> 1 dataset[‘test’].map(lambda e: tokenizer(e[‘texts’]), batched = True, batch_size = 1000, num_proc = 8)

/home/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint)
1316 logger.info(“Spawning {} processes”.format(num_proc))
1317 results = [pool.apply_async(self.class._map_single, kwds=kwds) for kwds in kwds_per_shard]
→ 1318 transformed_shards = [r.get() for r in results]
1319 logger.info(“Concatenating {} shards from multiprocessing”.format(num_proc))
1320 result = concatenate_datasets(transformed_shards)

/home/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py in (.0)
1316 logger.info(“Spawning {} processes”.format(num_proc))
1317 results = [pool.apply_async(self.class._map_single, kwds=kwds) for kwds in kwds_per_shard]
→ 1318 transformed_shards = [r.get() for r in results]
1319 logger.info(“Concatenating {} shards from multiprocessing”.format(num_proc))
1320 result = concatenate_datasets(transformed_shards)

/home/venv/lib/python3.6/site-packages/multiprocess/pool.py in get(self, timeout)
642 return self._value
643 else:
→ 644 raise self._value
645
646 def _set(self, i, obj):

/home/venv/lib/python3.6/site-packages/multiprocess/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
422 break
423 try:
→ 424 put(task)
425 except Exception as e:
426 job, idx = task[:2]

/home/venv/lib/python3.6/site-packages/multiprocess/connection.py in send(self, obj)
207 self._check_closed()
208 self._check_writable()
→ 209 self._send_bytes(_ForkingPickler.dumps(obj))
210
211 def recv_bytes(self, maxlength=None):

/home/venv/lib/python3.6/site-packages/multiprocess/connection.py in _send_bytes(self, buf)
394 n = len(buf)
395 # For wire compatibility with 3.2 and lower
→ 396 header = struct.pack("!i", n)
397 if n > 16384:
398 # The payload is large so Nagle’s algorithm won’t be triggered

error: ‘i’ format requires -2147483648 <= number <= 2147483647

Hi there, I got a (maybe) similar issue caused by the multiprocessing in map. Instead of opening a new thread, I thought I would use this one. Note that the error occurs only if I specify num_proc > 1, i.e. use multi-processing:

Code:

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

datasets = datasets.map(
    lambda sequence: tokenizer(sequence['text'], return_special_tokens_mask=True),
    batched=True,
    batch_size=1000,
    num_proc=2, #psutil.cpu_count()
    remove_columns=['text'],
)

datasets

Error:

Token indices sequence length is longer than the specified maximum sequence length for this model (8395 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py", line 203, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\fingerprint.py", line 337, in wrapper
    out = func(self, *args, **kwargs)
  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py", line 1695, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py", line 1608, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "<ipython-input-18-25a1ecec1896>", line 9, in <lambda>
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-18-25a1ecec1896> in <module>
      6 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
      7 
----> 8 datasets = datasets.map(
      9     lambda sequence: tokenizer(sequence['text'], return_special_tokens_mask=True),
     10     batched=True,

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\dataset_dict.py in map(self, function, with_indices, input_columns, batched, batch_size, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc)
    430             cache_file_names = {k: None for k in self}
    431         return DatasetDict(
--> 432             {
    433                 k: dataset.map(
    434                     function=function,

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\dataset_dict.py in <dictcomp>(.0)
    431         return DatasetDict(
    432             {
--> 433                 k: dataset.map(
    434                     function=function,
    435                     with_indices=with_indices,

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint)
   1483                 logger.info("Spawning {} processes".format(num_proc))
   1484                 results = [pool.apply_async(self.__class__._map_single, kwds=kwds) for kwds in kwds_per_shard]
-> 1485                 transformed_shards = [r.get() for r in results]
   1486                 logger.info("Concatenating {} shards from multiprocessing".format(num_proc))
   1487                 result = concatenate_datasets(transformed_shards)

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\datasets\arrow_dataset.py in <listcomp>(.0)
   1483                 logger.info("Spawning {} processes".format(num_proc))
   1484                 results = [pool.apply_async(self.__class__._map_single, kwds=kwds) for kwds in kwds_per_shard]
-> 1485                 transformed_shards = [r.get() for r in results]
   1486                 logger.info("Concatenating {} shards from multiprocessing".format(num_proc))
   1487                 result = concatenate_datasets(transformed_shards)

c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\.venv\lib\site-packages\multiprocess\pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

NameError: name 'tokenizer' is not defined

I am grateful for any help! :slight_smile:

Best solution I have found so far for my issue is to make it explicitly load the data set into RAM to avoid pickling. This does the trick, but does require a lot of RAM.

dataset = dataset.flatten_indices()

Not sure if it will solve your problem!

Hi, thanks for the reply! Unfortunately, this will not be an option as the dataset is indeed very large…

Can you shard it and load each shard individually into ram then multiprocess just that batch?

Hi ! It looks like something is not correctly pickled for multiprocessing.
Can you try to update dill and multiprocess ?
Can you try to use dill.dumps on your tokenizer or your dataset and see if it raises an error ?

Are you sure your tokenizer is defined ?
Can you re-run your notebook and try again ?

If the issues persist, I’d be happy to help you figure what’s wrong. In this case feel free to share a script or a google colab that reproduces this issue :slight_smile:

Tried to re-run, cleaned the .cache folder, all to the same end. The error remains. When I start the script the CPU peaks for a brief moment, until the error occurs. Let me try to provide you some more context:

txt_files = [str(i) for i in PATH_DATA.glob('all_*')]

datasets = load_dataset(
    path='text',
    data_files=txt_files[0],
    split=None,
)

#datasets['train'] = datasets['train'].select(range(1000))
datasets['train'], datasets['test'] = datasets['train'].train_test_split(test_size=0.02, shuffle=True, seed=1234).values()
datasets['train'], datasets['val']  = datasets['train'].train_test_split(test_size=0.02, shuffle=True, seed=1234).values()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

datasets = datasets.map(
    lambda sequence: tokenizer(sequence['text'], return_special_tokens_mask=True),
    batched=True,
    batch_size=1000,
    num_proc=psutil.cpu_count(),
    remove_columns=['text'],
)

Output of the dataset after splitting:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 210759
    })
    test: Dataset({
        features: ['text'],
        num_rows: 4390
    })
    val: Dataset({
        features: ['text'],
        num_rows: 4302
    })
})

Note that I only load text files from txt_files . MODEL_NAME equals 'roberta-base'. I am note quite sure what you refer to in terms of dill.dumps tho. Sorry, I am still relatively new to the ecosystem …

EDIT1: As indicated, if I set num_proc=1 the code runs fluently.

EDIT2: By the way, I run into the same issue if I use other lambda functions that involve anything that is external to the mapping procedure. For example, I tried to debug it by simply returning a dict with a np.ndarray() as required without acutally performing any pre-processing on the initial text. However, then it told me that name 'numpy' is not defined

EDIT3: Uninstalling and installing dill and multiprocess did not help. Also, dill.dumps() works fine on both, the tokenizer and dataset - or least it does not throw and error but returns a lot of cryptic stuff.

EDIT4: I can’t reproduce this error in a colab environment. Likely something local? :thinking:

This looks like an issue with the environment.
Does this work if you run the code from the command line instead of jupyter/ipython ?

Thanks for the reply! When running it in the command line within the .venv the same error occurs… Should I try to run it outside the virtual env as well?

EDIT: It returns the same error if I run it in cli outside the virtual env:

>>> tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
>>> tokenizer
PreTrainedTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

>>> datasets = datasets.map(
...     lambda x: tokenizer(x['text'], return_special_tokens_mask=True),
...     batched=True,
...     batch_size=10,
...     num_proc=2,#psutil.cpu_count(),
...     remove_columns=['text'],
... )
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
#1:   0%|                                                                                                                | 0/50 [00:00<?, ?ba/s]
#0:   0%|                                                                                                                | 0/50 [00:00<?, ?ba/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 203, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Python\lib\site-packages\datasets\fingerprint.py", line 337, in wrapper
    out = func(self, *args, **kwargs)
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1695, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1608, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "<stdin>", line 2, in <lambda>
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
    {
  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
    k: dataset.map(
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in map
    transformed_shards = [r.get() for r in results]
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 771, in get
    raise self._value
NameError: name 'tokenizer' is not defined

Could you try by defining an actual function instead of using a lambda ?

Also tried this as well, and again I ran into the very same error…

EDIT: So I transfered all the code from the Jupyter Notebook into a plain .py script and let the relevant part run again, now actually receiving a different error message. Not sure tho, if this may help to come closer to the issue :smiley:

Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 116, in spawn_main
  File "<string>", line 1, in <module>
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 116, in spawn_main
        exitcode = _main(fd, parent_sentinel)
exitcode = _main(fd, parent_sentinel)
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 125, in _main
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 125, in _main
    prepare(preparation_data)
      File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 236, in prepare
prepare(preparation_data)
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 287, in _fixup_main_from_path
    _fixup_main_from_path(data['init_main_from_path'])
    main_content = runpy.run_path(main_path,  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 287, in _fixup_main_from_path

  File "C:\Python\lib\runpy.py", line 265, in run_path
    main_content = runpy.run_path(main_path,
  File "C:\Python\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,    return _run_module_code(code, init_globals, run_name,

  File "C:\Python\lib\runpy.py", line 97, in _run_module_code
  File "C:\Python\lib\runpy.py", line 97, in _run_module_code
        _run_code(code, mod_globals, init_globals,_run_code(code, mod_globals, init_globals,

  File "C:\Python\lib\runpy.py", line 87, in _run_code
  File "C:\Python\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)exec(code, run_globals)

  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\Untitled-1.py", line 41, in <module>
  File "c:\Users\s_scho53\Desktop\L09_Desktop\_FiLMo\Untitled-1.py", line 41, in <module>
        datasets = datasets.map(datasets = datasets.map(

  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
        {{

  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
    k: dataset.map(
      File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1452, in map
k: dataset.map(
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1452, in map
        with Pool(num_proc, initargs=(RLock(),), initializer=tqdm.set_lock) as pool:with Pool(num_proc, initargs=(RLock(),), initializer=tqdm.set_lock) as pool:

  File "C:\Python\lib\site-packages\multiprocess\context.py", line 119, in Pool
  File "C:\Python\lib\site-packages\multiprocess\context.py", line 119, in Pool
        return Pool(processes, initializer, initargs, maxtasksperchild,return Pool(processes, initializer, initargs, maxtasksperchild,

  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 212, in __init__
  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 212, in __init__
        self._repopulate_pool()self._repopulate_pool()

  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 303, in _repopulate_pool
  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
return self._repopulate_pool_static(self._ctx, self.Process,  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 326, in _repopulate_pool_static

  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "C:\Python\lib\site-packages\multiprocess\process.py", line 121, in start
    w.start()
  File "C:\Python\lib\site-packages\multiprocess\process.py", line 121, in start
    self._popen = self._Popen(self)
      File "C:\Python\lib\site-packages\multiprocess\context.py", line 327, in _Popen
self._popen = self._Popen(self)
      File "C:\Python\lib\site-packages\multiprocess\context.py", line 327, in _Popen
return Popen(process_obj)
  File "C:\Python\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
    return Popen(process_obj)
  File "C:\Python\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
      File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
prep_data = spawn.get_preparation_data(process_obj._name)
      File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
  File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
    _check_not_importing_main()
      File "C:\Python\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
    RuntimeError: raise RuntimeError('''
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

This usually happens if you use a script without using

if __name__ == "__main__":
    ...

This is because you don’t want your processes to be created when your module is imported.
Can you try again using this ?

from datasets import load_dataset

if __name__ == "__main__":
    dataset = load_dataset(...)
    ...

@lhoestq thanks for the reply! Indeed, this fixes the RuntimeError in the previous reply. However, again it results in the all so familiar error from before… :triumph:

Using custom data configuration default-8f1dcd17d8b8834c
Reusing dataset text (C:\Users\s_scho53\.cache\huggingface\datasets\text\default-8f1dcd17d8b8834c\0.0.0\e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Token indices sequence length is longer than the specified maximum sequence length for this model (4078 > 512). Running this sequence through the model will result in indexing errors
#1:   0%|                                                                                                                                                                                                                    | 0/50 [00:00<?, ?ba/s]
#0:   0%|                                                                                                                                                                                                                    | 0/50 [00:00<?, ?ba/s]
multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 125, in worker      
    result = (True, func(*args, **kwds))
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 203, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Python\lib\site-packages\datasets\fingerprint.py", line 337, in wrapper  
    out = func(self, *args, **kwargs)
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1695, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1608, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "c:/Users/s_scho53/Desktop/L09_Desktop/_FiLMo/Untitled-1.py", line 40, in encode
    return tokenizer(x['text'], return_special_tokens_mask=True)
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:/Users/s_scho53/Desktop/L09_Desktop/_FiLMo/Untitled-1.py", line 42, in <module>
    datasets = datasets.map(
  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 432, in map
    {
  File "C:\Python\lib\site-packages\datasets\dataset_dict.py", line 433, in <dictcomp>
    k: dataset.map(
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in map
    transformed_shards = [r.get() for r in results]
  File "C:\Python\lib\site-packages\datasets\arrow_dataset.py", line 1485, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "C:\Python\lib\site-packages\multiprocess\pool.py", line 771, in get
    raise self._value
NameError: name 'tokenizer' is not defined

Can you share the script that you used ? I’ll try to reproduce on my side

Wrote you a DM :slight_smile:

1 Like

In case anyone ever encounters a similar or related issue in the future: What eventually helped me solve the issue was to shift the definition of the tokenizer outside of __name__ == '__main__':. This enabled the proper serialization and unpickling of the tokenizer in every sub-process. Thanks to @lhoestq for proposing the fix!