I am using the transformers/examples/pytorch/language-modeling/run_mlm.py file to train a BERT Model from scratch on a SLURM Cluster.
I use “accelerate launch” to launch the distributed training across multiple GPUs. The training on a single machine works fine, but takes too long so i want to utilize multiple machines / nodes.
The accelerate/examples/slurm/submit_multinode.sh file offers an example on how to launch a training script with SLURM on multiple nodes with accelerate launch.
When I try to launch the run_mlm.py script in the same manner, the socket times out due to no communication (increasing the timeout does not fix this):
File ".env/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
data = store.get(f"{prefix}{idx}")
File ".env/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
RuntimeError: Socket Timeout
agent_data = get_all(store, rank, key_prefix, world_size)
RuntimeError: Socket Timeout
I tried to replicated what the accelerate/examples/complete_nlp_example.py used for training in the example does differently but could not figure it out!
How would I need to configure the run_mlm.py script to be executable over multiple nodes via “accelerate launch”? I.e. are there generally some special requirements for a training script from multi-GPU to run on multiple GPU Nodes?
The shell script is as close as possible to the submit_multinode.sh example and my launch prompt:
srun accelerate launch \
--multi_gpu \
--num_processes 8
--num_machines 4 \
--mixed_precision fp16 \
--dynamo_backend cudagraphs \
--main_process_ip $head_node_ip \
--main_process_port 29500 \
--machine_rank 0 \
"./transformers/examples/pytorch/language-modeling/run_mlm.py" \
--model_type bert \
--config_name ./config.json \
--tokenizer_name hyperinfer/legal-tokenizer \
--dataset_name hyperinfer/old_cases_and_laws \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 96 \
--learning_rate 1e-4 \
--dataloader_pin_memory \
--dataloader_num_workers 16 \
--fp16 \
--do_train \
--do_eval \
--output_dir ./output \
--overwrite_output_dir \
The model config.json:
{
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 1024,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.35.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 100000
}