I am using BART and its
BartTokenizeFast for a Seq2Seq application. Since my dataset is fixed (i.e., I’m not using any kind of data augmentation or transformation during the training process), I thought that the most sensible option would be:
- Tokenizing all the sequences in the dataset in a preprocessing step, without padding.
- Padding the sequences online as needed for each mini-batch, using
DataCollatorForSeq2Seqas the dataloader
However, when I do this, I get the warning:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
I understand that this warning happens because I call the
tokenizer before the multiprocessing starts, i.e. before iterating through the dataloader. My questions are:
Am I wrong in assuming that pre-tokenization is, in my case, a better option than tokenizing each mini-batch in the dataloader?
Is it safe to ignore the warning in this situation?
Should I set TOKENIZERS_PARALLELISM=true, or TOKENIZERS_PARALLELISM=false, or either option doesn’t make any difference in my case?