Num_proc is not working with map

Hi All,

I have been struggling to make the map tokenization parallel, however, I couldn’t make it.

I request, could you please suggest me in this regard.

Here is the example code.

training_dataset = dataset.map(
lambda example, idx: tokenize(
example,
idx,
vocab,
in_df.columns,
decoder_dataset,
in_out_idx,
output_max_length,
),
remove_columns=dataset.column_names,
with_indices=True,
num_proc = 40)

num_proc only makes sense for slow tokenizers. If tokenizer.is_fast returns True, you should use map in the batched mode (fast tokenizers automatically tokenize a batch of samples in parallel) and set num_proc=None (fast tokenizers are written in Rust, and their multiprocessing module does not support Python multiprocessing) to parallelize the processing.

Hi @mariosasko,

Thanks for the reply.

However, this current tokenizer is a custom one written purely in python. Is there any possibility to parallelize the mapping process in this case.

Yes, num_proc (but not too high, e.g., os.cpu_count() is a good number), and batched=True should yield the best performance in that scenario.

Thanks @mariosasko,
Let me give a try and get back to you.