Hi I’m trying to run a training script using pytorch distributed on a cluster of single gpus nodes, however, I’m getting the following error:
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 310.2471537590027 seconds
This happen in the following part of the code:
with training_args.main_process_first(desc="train dataset map pre-processing"):
dataset_splits = preprocessing(
dataset=raw_dataset,
data_training_args=data_training_args,
tokenizer=tokenizer,
schema_encoding_fn=spider_schema_encoding,
additional_preprocessing_fn=spider_preprocessing
)
To run the script I use the following command:
python -m torch.distributed.launch \
--nproc_per_node=1 \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
notebooks/train.py --config-path configs/config.json
Where all the venvs are set when launching the training.
I’m I doing something wrong? It seems that the barriers are waiting for more local nodes.