Early stopping for eval loss causes timeout?

Hello,

I am using the run_glue_no_trainer.py (LINK) example and modified it to my need. I am running the example for bert-base-uncased using accelerate launch.

I added my early stopping class as following:

class early_stopping_callback:
  def __init__(self,min_delta=0,patience=5):
    self.min_delta=min_delta
    self.patience=patience
    self.counter=0
    self.lowest_loss=float('inf')
  def check_early_stopping(self,eval_loss):
    delta =  self.lowest_loss - eval_loss
    if delta >= self.min_delta:
      self.lowest_loss = eval_loss
      self.counter = 0
    else:
      self.counter += 1
      if self.counter >= self.patience:
        return True
    return False

After being initialized as

es_callback = early_stopping_callback()

the condition is checked at the end of an epoch (added at around line 622 in the original file):

        if args.checkpointing_steps == "epoch":
            output_dir = f"epoch_{epoch}"
            if args.output_dir is not None:
                output_dir = os.path.join(args.output_dir, output_dir)
            accelerator.save_state(output_dir)

        if es_callback.check_early_stopping(eval_loss.item()):
          print(f"Stopping early after epoch {epoch}")
          break

As you can see in the picture below, the criterion is reached at a certain point.

The weird thing is what happens next: The progress bar reports one additional step (I think), and after that, it stops doing anything. This also means that subsequent scheduled runs do not execute properly. Instead, I get a timeout error:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7462, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800586 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7463, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800814 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4174882 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 4174883) of binary: /home/lange/.conda/envs/test/bin/python3.9
Traceback (most recent call last):
  File "/home/test/.conda/envs/test/bin/accelerate", line 10, in <module>
    sys.exit(main())
  File "/home/test/.conda/envs/test/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/test/.conda/envs/test/lib/python3.9/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/test/.conda/envs/test/lib/python3.9/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/test/.conda/envs/test/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/test/.conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/test/.conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

I suspect it might an unproper use of the break statement in combination with how Accelerate works. Hopefully somebody can help out.

1 Like

Thanks for the report, let me run this today and reproduce to see if I can find what’s up. (The key is there’s a distributed reduce() op being done timing out.

You can also try:

pip install git+https://github.com/huggingface/accelerate

And run your code with:

ACCELERATE_DEBUG_MODE="1" accelerate launch ...

This will give us a clearer error I believe (or should)

Sorry for my late answer, I did not have access to the machine running the scripts the last days.

Unfortunately, using ACCELERATE_DEBUG_MODE=“1” did not change the error message generated and neither did updating accelerate.

Hi, did you manage to reproduce the error? I am still having the same problem.

Thanks for the ping, sorry! Will look at this today (truly today)

@GertDasPferd thanks! This is a DDP enjoyment, basically the break gets triggered on process 0 and never again on another process, leading to the hang. I’ve introduced a PR here (Introduce breakpoint API by muellerzr · Pull Request #1940 · huggingface/accelerate · GitHub) to add a utility that can help with this, and you can see the basic setup there. It was inspired by this post How to use "break" in DistributedDataParallel training - #7 by Rakshith_V - distributed-rpc - PyTorch Forums, and it made sense to just include this as part of the API for Accelerate

Thank your for your investigation. I have to admit that I do not understand all details of the problem.

Anyway, from how I understand your PR:
I would use accelerator.set_breakpoint() for when checking for the early stopping criterion. But where would I use the second function accelerator.check_breakpoint() ?

You’d use this immediatly after if es_callback.check_early_stopping():

E.g.:

if args.checkpointing_steps == "epoch":
    output_dir = f"epoch_{epoch}"
    if args.output_dir is not None:
        output_dir = os.path.join(args.output_dir, output_dir)
        accelerator.save_state(output_dir)

if es_callback.check_early_stopping(eval_loss.item()):
    print(f"Stopping early after epoch {epoch}")
    accelerator.set_breakpoint()
if accelerator.check_breakpoint():
    break

Hi again. Thanks for the response. I have been trying to properly update my environment to accelerate 0.23.0 but been havinv some problems. As soon as I managed to fix the problem, I will try out the solution and check if it is working properly.

Cheers!

Just to give a final update: the flags introduced with your PR solved the issue! Early stopping is now working as expected.

1 Like