Errors when training on multi node single gpu

SSamDav · February 24, 2022, 6:43pm

Hi I’m trying to run a training script using pytorch distributed on a cluster of single gpus nodes, however, I’m getting the following error:

ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 310.2471537590027 seconds

This happen in the following part of the code:

with training_args.main_process_first(desc="train dataset map pre-processing"):
        dataset_splits = preprocessing(
                dataset=raw_dataset,
                data_training_args=data_training_args,
                tokenizer=tokenizer,
                schema_encoding_fn=spider_schema_encoding,
                additional_preprocessing_fn=spider_preprocessing
        )

To run the script I use the following command:

python -m torch.distributed.launch \
        --nproc_per_node=1  \
        --nnodes=$NNODES \
        --node_rank=$NODE_RANK \
        --master_addr=$MASTER_ADDR \
        --master_port=$MASTER_PORT \
        notebooks/train.py --config-path configs/config.json

Where all the venvs are set when launching the training.

I’m I doing something wrong? It seems that the barriers are waiting for more local nodes.

sgugger · February 25, 2022, 7:34am

This may be due to the timeout argument exiting before your preprocessing is finished: preprocessing is only run on the local main process and the results cached for the others, but there is a barrier at the end for all of them to join.

Since you are running on a cluster of singe GUs nodes, what’s happening is that one of the GPUs finished way earlier than the other and it seems like it was tired to wait (unless there is something missing from the traceback you didn’t include). You can remove that line since preprocessing will need to be done on all your nodes.

Topic		Replies	Views
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier 🤗Transformers	0	1069	April 14, 2023
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	610	August 15, 2024
Distributed Training w/ Trainer 🤗Transformers	11	8992	June 3, 2025
torch.distributed.elastic.multiprocessing.errors.ChildFailedError 🤗Transformers	19	40143	January 22, 2025
Multi gpu training 🤗Transformers	3	6014	April 24, 2022

Errors when training on multi node single gpu

Related topics