I am using BART and its BartTokenizeFast
for a Seq2Seq application. Since my dataset is fixed (i.e., I’m not using any kind of data augmentation or transformation during the training process), I thought that the most sensible option would be:
- Tokenizing all the sequences in the dataset in a preprocessing step, without padding.
- Padding the sequences online as needed for each mini-batch, using
DataCollatorForSeq2Seq
as the dataloadercollate_fn
.
However, when I do this, I get the warning:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
I understand that this warning happens because I call the tokenizer
before the multiprocessing starts, i.e. before iterating through the dataloader. My questions are:
-
Am I wrong in assuming that pre-tokenization is, in my case, a better option than tokenizing each mini-batch in the dataloader?
-
Is it safe to ignore the warning in this situation?
-
Should I set TOKENIZERS_PARALLELISM=true, or TOKENIZERS_PARALLELISM=false, or either option doesn’t make any difference in my case?