Early stopping implementation in accelerate?

aclifton314 · September 7, 2022, 6:15pm

Is it possible to have an implementation of early stopping while using Accelerate? I know accelerate handles distributed training for normal pytorch training loops, but I’m not quite sure how to handle early stopping since one process could meet the early stop criteria and another may not. I was thinking of something like this:

for epoch in range(num_epochs):
    for batch in train_dataloader:
        optimizer.zero_grad()
        outputs = my_model(**batch)
        loss = outputs['loss']    
        my_accelerator.backward(loss)
        optimizer.step()
        
    metric = my_eval(my_model, dev_dataloader)  # evalution on dev set (i.e., holdout from training)
    if my_early_stop.step(metric):
        break  # early stop criterion is met, we can stop now

sgugger · September 7, 2022, 6:36pm

Why would the process see different metrics? They’ll all have the same one normally.

aclifton314 · September 7, 2022, 6:40pm

@sgugger Thanks for your response. Maybe my understanding of accelerate is incorrect, but I thought each process saw different slices of the training and dev sets. Each process would compute the same metric but on different slices of the datasets.

sgugger · September 7, 2022, 6:45pm

In which case you should gather the tensors before feeding them to your metric function, as is done in all examples.

aclifton314 · September 7, 2022, 6:47pm

Ah yes, thank you! Apologies for the simple question. I’m still learning Accelerate. Thank you for your help!

Topic		Replies	Views
Single batch training on multi-gpu 🤗Accelerate	1	1025	October 8, 2023
Decreasing performance when using Accelerate 🤗Accelerate	1	2295	March 8, 2022
Replicating the same code in gpus 🤗Accelerate	1	356	March 6, 2023
Early stopping for eval loss causes timeout? 🤗Accelerate	10	1743	June 20, 2024
Clarification on training metrics 🤗Accelerate	0	483	February 10, 2023

Early stopping implementation in accelerate?

Related topics