Pre-tokenization vs. mini-batch tokenization and TOKENIZERS_PARALLELISM warning

dpernes · November 5, 2021, 1:12pm

I am using BART and its BartTokenizeFast for a Seq2Seq application. Since my dataset is fixed (i.e., I’m not using any kind of data augmentation or transformation during the training process), I thought that the most sensible option would be:

Tokenizing all the sequences in the dataset in a preprocessing step, without padding.
Padding the sequences online as needed for each mini-batch, using DataCollatorForSeq2Seq as the dataloader collate_fn.

However, when I do this, I get the warning:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

I understand that this warning happens because I call the tokenizer before the multiprocessing starts, i.e. before iterating through the dataloader. My questions are:

Am I wrong in assuming that pre-tokenization is, in my case, a better option than tokenizing each mini-batch in the dataloader?
Is it safe to ignore the warning in this situation?
Should I set TOKENIZERS_PARALLELISM=true, or TOKENIZERS_PARALLELISM=false, or either option doesn’t make any difference in my case?

Sifal · October 28, 2023, 8:31am

I’m running into the same wall, do you have a recomended approach?

Iamexperimenting · March 3, 2024, 2:36pm

@nielsr I’m also facing the same problem. Can you guide here?

Topic		Replies	Views
Transformers Tokenizer on GPU? 🤗Transformers	3	15368	December 17, 2020
Tokenizers v0.8.0 is out! 🤗Tokenizers	0	1515	July 7, 2020
Cache & parallelize long tokenization step 🤗Transformers	2	1006	November 11, 2022
Speeding up Tokenization on large text corpus 🤗Transformers	0	448	September 26, 2022
Is_pretokenized argument for tokenizer doesn't work? 🤗Transformers	1	1797	September 18, 2020

Pre-tokenization vs. mini-batch tokenization and TOKENIZERS_PARALLELISM warning

Related topics