Pre-tokenization vs. mini-batch tokenization and TOKENIZERS_PARALLELISM warning

I am using BART and its BartTokenizeFast for a Seq2Seq application. Since my dataset is fixed (i.e., I’m not using any kind of data augmentation or transformation during the training process), I thought that the most sensible option would be:

  • Tokenizing all the sequences in the dataset in a preprocessing step, without padding.
  • Padding the sequences online as needed for each mini-batch, using DataCollatorForSeq2Seq as the dataloader collate_fn.

However, when I do this, I get the warning:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

I understand that this warning happens because I call the tokenizer before the multiprocessing starts, i.e. before iterating through the dataloader. My questions are:

  1. Am I wrong in assuming that pre-tokenization is, in my case, a better option than tokenizing each mini-batch in the dataloader?

  2. Is it safe to ignore the warning in this situation?

  3. Should I set TOKENIZERS_PARALLELISM=true, or TOKENIZERS_PARALLELISM=false, or either option doesn’t make any difference in my case?

3 Likes

I’m running into the same wall, do you have a recomended approach?

@nielsr I’m also facing the same problem. Can you guide here?