Accelerate Multi-Node Training

I am using the transformers/examples/pytorch/language-modeling/run_mlm.py file to train a BERT Model from scratch on a SLURM Cluster.

I use “accelerate launch” to launch the distributed training across multiple GPUs. The training on a single machine works fine, but takes too long so i want to utilize multiple machines / nodes.

The accelerate/examples/slurm/submit_multinode.sh file offers an example on how to launch a training script with SLURM on multiple nodes with accelerate launch.

When I try to launch the run_mlm.py script in the same manner, the socket times out due to no communication (increasing the timeout does not fix this):

File ".env/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    data = store.get(f"{prefix}{idx}")
  File ".env/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
RuntimeError: Socket Timeout
    agent_data = get_all(store, rank, key_prefix, world_size)
RuntimeError: Socket Timeout

I tried to replicated what the accelerate/examples/complete_nlp_example.py used for training in the example does differently but could not figure it out!

How would I need to configure the run_mlm.py script to be executable over multiple nodes via “accelerate launch”? I.e. are there generally some special requirements for a training script from multi-GPU to run on multiple GPU Nodes?

The shell script is as close as possible to the submit_multinode.sh example and my launch prompt:

srun accelerate launch \
--multi_gpu \
--num_processes 8 
--num_machines 4 \
--mixed_precision fp16 \
--dynamo_backend cudagraphs \
--main_process_ip $head_node_ip \
--main_process_port 29500 \
--machine_rank 0 \
"./transformers/examples/pytorch/language-modeling/run_mlm.py" \
--model_type bert \
--config_name ./config.json \
--tokenizer_name hyperinfer/legal-tokenizer \
--dataset_name hyperinfer/old_cases_and_laws \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 96 \
--learning_rate 1e-4 \
--dataloader_pin_memory \
--dataloader_num_workers 16 \
--fp16 \
--do_train \
--do_eval \
--output_dir ./output \
--overwrite_output_dir \

The model config.json:

{
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 1024,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 100000
}