Hi All,
I have been struggling to make the map tokenization parallel, however, I couldn’t make it.
I request, could you please suggest me in this regard.
Here is the example code.
training_dataset = dataset.map(
lambda example, idx: tokenize(
example,
idx,
vocab,
in_df.columns,
decoder_dataset,
in_out_idx,
output_max_length,
),
remove_columns=dataset.column_names,
with_indices=True,
num_proc = 40)
num_proc
only makes sense for slow tokenizers. If tokenizer.is_fast
returns True
, you should use map
in the batched mode (fast tokenizers automatically tokenize a batch of samples in parallel) and set num_proc=None
(fast tokenizers are written in Rust, and their multiprocessing module does not support Python multiprocessing) to parallelize the processing.
Hi @mariosasko,
Thanks for the reply.
However, this current tokenizer is a custom one written purely in python. Is there any possibility to parallelize the mapping process in this case.
Yes, num_proc
(but not too high, e.g., os.cpu_count()
is a good number), and batched=True
should yield the best performance in that scenario.
Thanks @mariosasko,
Let me give a try and get back to you.