Stucked on tokenization before training when using 3 GPU, but not when using 2 GPU

I intend to use run_mlm.py to train RoBERTa from scratch. I have 3 A100 on my machine, so I entered the following command:

CUDA_VISIBLE_DEVICES=0,1,2 python run_mlm.py \
    --model_type roberta \
    --config_overrides="num_hidden_layers=6,max_position_embeddings=514" \
    --tokenizer_name MyModel \
    --train_file ./data/corpus_dedup.txt \
    --max_seq_length 512 \
    --line_by_line True \
    --per_device_train_batch_size 64 \
    --do_train \
    --overwrite_output_dir True \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 40 \
    --fp16 True \
    --output_dir MyModel \
    --save_total_limit 1

When I try to do the training using a 3-GPU configuration, I’m getting stucked for dozens of hours in the tokenization before the training, with the following message:

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the callmethod is faster than using a method to encode the text followed by a call to thepad method to get a padded encoding.

Aditionally, when I try to do the training with only 2 GPU (CUDA_VISIBLE_DEVICES=0,1, followed by the same parameters), my training runs normally :thinking:. What can be done about it? I really would like to use all the GPUs and have less training steps.