Num_proc is not working with map

Hi All,

I have been struggling to make the map tokenization parallel, however, I couldn’t make it.

I request, could you please suggest me in this regard.

Here is the example code.

training_dataset = dataset.map(
lambda example, idx: tokenize(
example,
idx,
vocab,
in_df.columns,
decoder_dataset,
in_out_idx,
output_max_length,
),
remove_columns=dataset.column_names,
with_indices=True,
num_proc = 40)

num_proc only makes sense for slow tokenizers. If tokenizer.is_fast returns True, you should use map in the batched mode (fast tokenizers automatically tokenize a batch of samples in parallel) and set num_proc=None (fast tokenizers are written in Rust, and their multiprocessing module does not support Python multiprocessing) to parallelize the processing.

Hi @mariosasko,

Thanks for the reply.

However, this current tokenizer is a custom one written purely in python. Is there any possibility to parallelize the mapping process in this case.

Yes, num_proc (but not too high, e.g., os.cpu_count() is a good number), and batched=True should yield the best performance in that scenario.

Thanks @mariosasko,
Let me give a try and get back to you.

If you try that and it doesn’t work, try passing the custom tokenizer in as a fn_kwargs. The .map function does not play nice with globally defined variables. Ex.

my_dataset = my_dataset.map(
    lambda example, idx: custom_tokenizer(example, idx, 
              input_vocab, 
              input_df.columns,
              decoder, 
              in_out_index, 
              max_length,), 
    remove_columns=dataset.column_names, 
    with_indicies=True, 
    num_proc=40, 
    fn_kwargs={
              'custom_tokenizer': tokenzier, 
              'input_vocab': vocab, 
              'input_df': in_df, 
              'decoder': decoder_dataset,
              'in_out_index': in_out_idx, 
              'max_length': output_max_length})

I would personally use a def function instead and feed in an array containing the parameters.