Errors when training on multi node single gpu

Hi I’m trying to run a training script using pytorch distributed on a cluster of single gpus nodes, however, I’m getting the following error:

ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 310.2471537590027 seconds

This happen in the following part of the code:

with training_args.main_process_first(desc="train dataset map pre-processing"):
        dataset_splits = preprocessing(
                dataset=raw_dataset,
                data_training_args=data_training_args,
                tokenizer=tokenizer,
                schema_encoding_fn=spider_schema_encoding,
                additional_preprocessing_fn=spider_preprocessing
        )

To run the script I use the following command:

python -m torch.distributed.launch \
        --nproc_per_node=1  \
        --nnodes=$NNODES \
        --node_rank=$NODE_RANK \
        --master_addr=$MASTER_ADDR \
        --master_port=$MASTER_PORT \
        notebooks/train.py --config-path configs/config.json

Where all the venvs are set when launching the training.

I’m I doing something wrong? It seems that the barriers are waiting for more local nodes.

This may be due to the timeout argument exiting before your preprocessing is finished: preprocessing is only run on the local main process and the results cached for the others, but there is a barrier at the end for all of them to join.

Since you are running on a cluster of singe GUs nodes, what’s happening is that one of the GPUs finished way earlier than the other and it seems like it was tired to wait (unless there is something missing from the traceback you didn’t include). You can remove that line since preprocessing will need to be done on all your nodes.