Num_proc is not working with map

mohanvrk · July 4, 2023, 6:11pm

Hi All,

I have been struggling to make the map tokenization parallel, however, I couldn’t make it.

I request, could you please suggest me in this regard.

Here is the example code.

training_dataset = dataset.map(
lambda example, idx: tokenize(
example,
idx,
vocab,
in_df.columns,
decoder_dataset,
in_out_idx,
output_max_length,
),
remove_columns=dataset.column_names,
with_indices=True,
num_proc = 40)

mariosasko · July 6, 2023, 12:51pm

num_proc only makes sense for slow tokenizers. If tokenizer.is_fast returns True, you should use map in the batched mode (fast tokenizers automatically tokenize a batch of samples in parallel) and set num_proc=None (fast tokenizers are written in Rust, and their multiprocessing module does not support Python multiprocessing) to parallelize the processing.

mohanvrk · July 6, 2023, 1:18pm

Hi @mariosasko,

Thanks for the reply.

However, this current tokenizer is a custom one written purely in python. Is there any possibility to parallelize the mapping process in this case.

mariosasko · July 6, 2023, 1:54pm

Yes, num_proc (but not too high, e.g., os.cpu_count() is a good number), and batched=True should yield the best performance in that scenario.

mohanvrk · July 6, 2023, 3:57pm

Thanks @mariosasko,
Let me give a try and get back to you.

wrexial33 · April 15, 2024, 4:13pm

If you try that and it doesn’t work, try passing the custom tokenizer in as a fn_kwargs. The .map function does not play nice with globally defined variables. Ex.

my_dataset = my_dataset.map(
    lambda example, idx: custom_tokenizer(example, idx, 
              input_vocab, 
              input_df.columns,
              decoder, 
              in_out_index, 
              max_length,), 
    remove_columns=dataset.column_names, 
    with_indicies=True, 
    num_proc=40, 
    fn_kwargs={
              'custom_tokenizer': tokenzier, 
              'input_vocab': vocab, 
              'input_df': in_df, 
              'decoder': decoder_dataset,
              'in_out_index': in_out_idx, 
              'max_length': output_max_length})

I would personally use a def function instead and feed in an array containing the parameters.

Topic		Replies	Views
Dataset map function takes forever to run! 🤗Datasets	16	6672	August 15, 2024
Tokenizer dataset is very slow 🤗Tokenizers	3	4327	March 2, 2024
NameError when tokenizing with num_proc Course	0	1131	May 7, 2022
Map multiprocessing Issue 🤗Datasets	31	17637	July 16, 2024
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	900	December 23, 2024

Num_proc is not working with map

Related topics