Checkpoint missing Optimizer.pt? How to Resume?

I tried to train a model with HF and it helped me a lot! My only problem is resuming the training. As you can see in the screenshot below, only my first checkpoint contains the data I expect. My question is, is there a flag where I can turn off saving the checkpoints (I ask only to turn it off!)? Can I still continue the training?

Im using load_best_model_at_end save_total_limit = 3 overwrite_output_dir

I didnt change my Code i just updated to using the latest HF Version
!pip install -q git+https://github.com/huggingface/transformers

Is there any way to resum from the last Checkpoint? Maybe a Flag init_epoch etc?

TrainingArguments(output_dir=/share/datasets/output_run, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.STEPS, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=16, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=0.0001, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=20.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/May12_05-06-46_a600ce861ff7, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=1000, save_strategy=IntervalStrategy.STEPS, save_steps=1000, save_total_limit=3, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=2, past_index=-1, run_name=cv_sm_1, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=True, metric_for_best_model=loss, greater_is_better=False, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=True, length_column_name=length, report_to=['wandb'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, _n_gpu=1, mp_parameters=)
!find / -name optimizer.pt

Just returned the Checkpoint from withing the Screenshot

@sgugger
do you by any chance know what this could be?

The Strange thing is, i resumed training from last valid checkpoint and the model normally generates checkpoints as expected, but at some point the checkpoint folder only contains rng_state.pth as shown on the screenshot.

im somehow to dumb to find the edit button, so here is the training args prettified

TrainingArguments(
    output_dir=/share/datasets/output_run, 
    overwrite_output_dir=True, 
    do_train=True, 
    do_eval=True, 
    do_predict=False, 
    evaluation_strategy=IntervalStrategy.STEPS, 
    prediction_loss_only=False, 
    per_device_train_batch_size=20, 
    per_device_eval_batch_size=16, 
    gradient_accumulation_steps=1, 
    eval_accumulation_steps=None, 
    learning_rate=0.0001, 
    weight_decay=0.0, 
    adam_beta1=0.9, 
    adam_beta2=0.999, 
    adam_epsilon=1e-08, 
    max_grad_norm=1.0, 
    num_train_epochs=20.0, 
    max_steps=-1, 
    lr_scheduler_type=SchedulerType.LINEAR, 
    warmup_ratio=0.0, 
    warmup_steps=0, 
    logging_dir=runs/May12_05-06-46_a600ce861ff7, 
    logging_strategy=IntervalStrategy.STEPS, 
    logging_first_step=False, 
    logging_steps=1000, 
    save_strategy=IntervalStrategy.STEPS, 
    save_steps=1000, 
    save_total_limit=3, 
    no_cuda=False, 
    seed=42, 
    fp16=True, 
    fp16_opt_level=O1, 
    fp16_backend=auto, 
    fp16_full_eval=False, 
    local_rank=-1, 
    tpu_num_cores=None, 
    tpu_metrics_debug=False, 
    debug=[], 
    dataloader_drop_last=False, 
    eval_steps=1000, 
    dataloader_num_workers=2, 
    past_index=-1, 
    run_name=cv_sm_1, 
    disable_tqdm=False, 
    remove_unused_columns=True, 
    label_names=None, 
    load_best_model_at_end=True, 
    metric_for_best_model=loss, 
    greater_is_better=False, 
    ignore_data_skip=False, 
    sharded_ddp=[], 
    deepspeed=None, 
    label_smoothing_factor=0.0, 
    adafactor=False, 
    group_by_length=True, 
    length_column_name=length, 
    report_to=['wandb'], 
    ddp_find_unused_parameters=None, 
    dataloader_pin_memory=True, 
    skip_memory_metrics=False, 
    use_legacy_prediction_loop=False, 
    push_to_hub=False, 
    resume_from_checkpoint=None, 
    _n_gpu=1, 
    mp_parameters=
)

To debug the error I tried to change the original code that stores the checkpoints and added some prints to debug. I have marked changes I made to the code with comments. I have written the original code in the comments.

def _save_checkpoint(self, model, trial, metrics=None):
        # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
        # want to save except FullyShardedDDP.
        # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

        # Save model checkpoint
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        if self.hp_search_backend is not None and trial is not None:
            if self.hp_search_backend == HPSearchBackend.OPTUNA:
                run_id = trial.number
            else:
                from ray import tune

                run_id = tune.get_trial_id()
            run_name = self.hp_name(trial) if self.hp_name is not None else f"run-{run_id}"
            run_dir = os.path.join(self.args.output_dir, run_name)
        else:
            run_dir = self.args.output_dir
            self.store_flos()

        output_dir = os.path.join(run_dir, checkpoint_folder)
        self.save_model(output_dir)
        if self.deepspeed:
            print(11, "CP")                            
            # under zero3 model file itself doesn't get saved since it's bogus! Unless deepspeed
            # config `stage3_gather_fp16_weights_on_model_save` is True
            self.deepspeed.save_checkpoint(output_dir)

        # Save optimizer and scheduler
        if self.sharded_ddp == ShardedDDPOption.SIMPLE:
            print(12, "CP")               #<- new                           
            self.optimizer.consolidate_state_dict()

        if is_torch_tpu_available():
            xm.rendezvous("saving_optimizer_states")
            xm.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
            with warnings.catch_warnings(record=True) as caught_warnings:
                xm.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                reissue_pt_warnings(caught_warnings)
        elif is_sagemaker_mp_enabled():
            if smp.dp_rank() == 0:
                # Consolidate the state dict on all processed of dp_rank 0
                opt_state_dict = self.optimizer.state_dict()
                # Save it and the scheduler on the main process
                if self.is_world_process_zero():
                    torch.save(opt_state_dict, os.path.join(output_dir, "optimizer.pt"))
                    with warnings.catch_warnings(record=True) as caught_warnings:
                        torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    reissue_pt_warnings(caught_warnings)
                    if self.use_amp:
                        torch.save(self.scaler.state_dict(), os.path.join(output_dir, "scaler.pt"))
        print(f"is_world_process_zero: {self.is_world_process_zero()} | self.deepspeed: {self.deepspeed}")  #<- new

        if True:  #<- new: old: elif self.is_world_process_zero() and not self.deepspeed:
            print(12, "CP")                                                        
            # deepspeed.save_checkpoint above saves model/optim/sched
            torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
            with warnings.catch_warnings(record=True) as caught_warnings:
                torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
            reissue_pt_warnings(caught_warnings)
            if self.use_amp:
                torch.save(self.scaler.state_dict(), os.path.join(output_dir, "scaler.pt"))

        # Determine the new best metric / best model checkpoint
        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (
                self.state.best_metric is None
                or self.state.best_model_checkpoint is None
                or operator(metric_value, self.state.best_metric)
            ):
                self.state.best_metric = metric_value
                self.state.best_model_checkpoint = output_dir

        # Save the Trainer state
        if self.is_world_process_zero():
            self.state.save_to_json(os.path.join(output_dir, "trainer_state.json"))

        # Maybe delete some older checkpoints.
        if self.is_world_process_zero():
            self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

        # Save RNG state in non-distributed training
        rng_states = {
            "python": random.getstate(),
            "numpy": np.random.get_state(),
            "cpu": torch.random.get_rng_state(),
        }
        if torch.cuda.is_available():
            if self.args.local_rank == -1:
                # In non distributed, we save the global CUDA RNG state (will take care of DataParallel)
                rng_states["cuda"] = torch.cuda.random.get_rng_state_all()
            else:
                rng_states["cuda"] = torch.cuda.random.get_rng_state()

        if is_torch_tpu_available():
            rng_states["xla"] = xm.get_rng_state()

        # A process can arrive here before the process 0 has a chance to save the model, in which case output_dir may
        # not yet exist.
        os.makedirs(output_dir, exist_ok=True)
        local_rank = xm.get_local_ordinal() if is_torch_tpu_available() else self.args.local_rank
        if local_rank == -1:
            torch.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
        else:
            torch.save(rng_states, os.path.join(output_dir, f"rng_state_{local_rank}.pth"))

To debug the error based on this change I further trained a model from a checkpoint at step 10k. The following is the log of the evaluation (and saving of the checkpoints) at step 11k and 12k. Here at 11k everything was saved correctly and 12k incorrectly.

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 16
                                                                                
                                                                                {'eval_loss': 0.15945090353488922, 'eval_wer': 0.23429500203169443, 'eval_runtime': 233.1867, 'eval_samples_per_second': 12.865, 'epoch': 8.15}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 188/188 [03:50<00:00,  1.04s/it]
ISWPZ 3 True
Saving model checkpoint to /share/datasets/output_run/checkpoint-11000
Configuration saved in /share/datasets/output_run/checkpoint-11000/config.json
Model weights saved in /share/datasets/output_run/checkpoint-11000/pytorch_model.bin
Configuration saved in /share/datasets/output_run/checkpoint-11000/preprocessor_config.json
ISWPZ 3 True
is_world_process_zero: True | self.deepspeed: None
12 CP
ISWPZ 3 True
ISWPZ 3 True
Deleting older checkpoint [/share/datasets/output_run/checkpoint-10000] due to args.save_total_limit
41%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                  | 11001/27000 [1:08:02<409:38:17, 92.17s/it]
...
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                  | 12000/27000 [2:05:16<6:13:01,  1.49s/it]
{'loss': 0.1113, 'learning_rate': 5.5566666666666664e-05, 'epoch': 8.89}


 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                  | 12000/27000 [2:05:16<6:13:01,  1.49s/it]
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 16
                                                                                
{'eval_loss': 0.15602770447731018, 'eval_wer': 0.22799674928890695, 'eval_runtime': 233.1932, 'eval_samples_per_second': 12.865, 'epoch': 8.89}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 188/188 [03:50<00:00,  1.05s/it]
ISWPZ 3 True
Saving model checkpoint to /share/datasets/output_run/checkpoint-12000
Configuration saved in /share/datasets/output_run/checkpoint-12000/config.json
Model weights saved in /share/datasets/output_run/checkpoint-12000/pytorch_model.bin
Configuration saved in /share/datasets/output_run/checkpoint-12000/preprocessor_config.json
ISWPZ 3 True
is_world_process_zero: True | self.deepspeed: None
12 CP
ISWPZ 3 True
ISWPZ 3 True
Deleting older checkpoint [/share/datasets/output_run/checkpoint-12000] due to args.save_total_limit

The Log-Output of both Steps looks exactly the same, but 11k got Saved correctly and 12k didnt.

image

I think your problem is due to a bug in the current checkpoints saving: when you have load_best_model_at_end set at True, it accidentally deletes the oldest checkpoint if the best checkpoint is before it, instead of deleting the oldest checkpoint (this is because you have save_total_limit > 0.

Will try to fix this today.

1 Like

ty @sgugger ! It would be nice if you reply to this post / mention me, so that i can see whether you could fix it or not :slight_smile: Thanks for your work!

Here is the PR with the fix.

1 Like

on first glance, it looks like all the runs created their checkpoints overnight as expected. Thanks for the PR! :slight_smile:

1 Like