I intend to use run_mlm.py to train RoBERTa from scratch. I have 3 A100 on my machine, so I entered the following command:
CUDA_VISIBLE_DEVICES=0,1,2 python run_mlm.py \
--model_type roberta \
--config_overrides="num_hidden_layers=6,max_position_embeddings=514" \
--tokenizer_name MyModel \
--train_file ./data/corpus_dedup.txt \
--max_seq_length 512 \
--line_by_line True \
--per_device_train_batch_size 64 \
--do_train \
--overwrite_output_dir True \
--gradient_accumulation_steps 4 \
--num_train_epochs 40 \
--fp16 True \
--output_dir MyModel \
--save_total_limit 1
When I try to do the training using a 3-GPU configuration, I’m getting stucked for dozens of hours in the tokenization before the training, with the following message:
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the
callmethod is faster than using a method to encode the text followed by a call to the
pad method to get a padded encoding.
Aditionally, when I try to do the training with only 2 GPU (CUDA_VISIBLE_DEVICES=0,1
, followed by the same parameters), my training runs normally . What can be done about it? I really would like to use all the GPUs and have less training steps.