Hi All,
I have been struggling to make the map tokenization parallel, however, I couldn’t make it.
I request, could you please suggest me in this regard.
Here is the example code.
training_dataset = dataset.map(
lambda example, idx: tokenize(
example,
idx,
vocab,
in_df.columns,
decoder_dataset,
in_out_idx,
output_max_length,
),
remove_columns=dataset.column_names,
with_indices=True,
num_proc = 40)
num_proc
only makes sense for slow tokenizers. If tokenizer.is_fast
returns True
, you should use map
in the batched mode (fast tokenizers automatically tokenize a batch of samples in parallel) and set num_proc=None
(fast tokenizers are written in Rust, and their multiprocessing module does not support Python multiprocessing) to parallelize the processing.
Hi @mariosasko,
Thanks for the reply.
However, this current tokenizer is a custom one written purely in python. Is there any possibility to parallelize the mapping process in this case.
Yes, num_proc
(but not too high, e.g., os.cpu_count()
is a good number), and batched=True
should yield the best performance in that scenario.
Thanks @mariosasko,
Let me give a try and get back to you.
If you try that and it doesn’t work, try passing the custom tokenizer in as a fn_kwargs. The .map function does not play nice with globally defined variables. Ex.
my_dataset = my_dataset.map(
lambda example, idx: custom_tokenizer(example, idx,
input_vocab,
input_df.columns,
decoder,
in_out_index,
max_length,),
remove_columns=dataset.column_names,
with_indicies=True,
num_proc=40,
fn_kwargs={
'custom_tokenizer': tokenzier,
'input_vocab': vocab,
'input_df': in_df,
'decoder': decoder_dataset,
'in_out_index': in_out_idx,
'max_length': output_max_length})
I would personally use a def function instead and feed in an array containing the parameters.