Hello,
I am using the run_glue_no_trainer.py (LINK) example and modified it to my need. I am running the example for bert-base-uncased using accelerate launch.
I added my early stopping class as following:
class early_stopping_callback:
def __init__(self,min_delta=0,patience=5):
self.min_delta=min_delta
self.patience=patience
self.counter=0
self.lowest_loss=float('inf')
def check_early_stopping(self,eval_loss):
delta = self.lowest_loss - eval_loss
if delta >= self.min_delta:
self.lowest_loss = eval_loss
self.counter = 0
else:
self.counter += 1
if self.counter >= self.patience:
return True
return False
After being initialized as
es_callback = early_stopping_callback()
the condition is checked at the end of an epoch (added at around line 622 in the original file):
if args.checkpointing_steps == "epoch":
output_dir = f"epoch_{epoch}"
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
if es_callback.check_early_stopping(eval_loss.item()):
print(f"Stopping early after epoch {epoch}")
break
As you can see in the picture below, the criterion is reached at a certain point.
The weird thing is what happens next: The progress bar reports one additional step (I think), and after that, it stops doing anything. This also means that subsequent scheduled runs do not execute properly. Instead, I get a timeout error:
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7462, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800586 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7463, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800814 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4174882 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 4174883) of binary: /home/lange/.conda/envs/test/bin/python3.9
Traceback (most recent call last):
File "/home/test/.conda/envs/test/bin/accelerate", line 10, in <module>
sys.exit(main())
File "/home/test/.conda/envs/test/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/test/.conda/envs/test/lib/python3.9/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/home/test/.conda/envs/test/lib/python3.9/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/home/test/.conda/envs/test/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/test/.conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/test/.conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I suspect it might an unproper use of the break statement in combination with how Accelerate works. Hopefully somebody can help out.