[Solved] Cannot restart training from deepspeed checkpoint

Hello :slight_smile:

I’m using deepspeed to train big models on two GPUs (that cannot fit on one).
I’m using the hugging face Trainer with deepspeed params like so :


trainer = Trainer(
    model=model,
    train_dataset=CustomDataset(...),
    eval_dataset=CustomDataset(...),
    tokenizer=tokenizer,
    compute_metrics=custom_metric,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    args=TrainingArguments(
        output_dir=f"output_{backbone.replace('/', '_')}",
        do_train=True,
        do_eval=True,
        do_predict=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        num_train_epochs=5,
        fp16=True,
        deepspeed={
            "fp16": {
                "enabled": True,
                "min_loss_scale": 1,
                "opt_level": "O3",
            },
            "wandb": {
                "enabled": True,
                "team": "my_team",
                "project": "my_project"
            },
            "optimizer": {
                "type": "AdamW",
                "params": {
                    "lr": "auto",
                    "betas": "auto",
                    "eps": "auto",
                    "weight_decay": "auto",
                },
            },
            "scheduler": {
                "type": "WarmupLR",
                "params": {
                    "warmup_min_lr": "auto",
                    "warmup_max_lr": "auto",
                    "warmup_num_steps": "auto",
                },
            },
            "zero_optimization": {
                "stage": 3,
                "offload_optimizer": {"device": "cpu"},
                "offload_param": {"device": "cpu"},
                "allgather_partitions": True,
                "allgather_bucket_size": 5e8,
                "contiguous_gradients": True,
            },
            "train_batch_size": 4,
            "train_micro_batch_size_per_gpu": 2,
        },
        report_to="wandb",
    ),
)

The first training went fine and I have folders checkpoint with :

global_step29500
|--zero_pp_rank_0_mp_rank_00_model_states.pt
|--zero_pp_rank_0_mp_rank_00_optim_states.pt
special_tokens_map.json
trainer_state.json
zero_to_fp32.py
config.json
merges.txt
tokenizer.json
training_args.bin
latest       
rng_state_0.pth
tokenizer_config.json    
vocab.json

However when I try to resume_from_checkpoint (in trainer.train()) it says :
raise ValueError(f"Can't find a valid checkpoint")

So I try to use the zero_to_fp32.py file to create a latest.ckpt but without success.

The doc of deepspeed says something about “automatically loading weights” but I’m sceptic.

Thanks in advance,
Have a great day.

For the posterity :

After reading the huggingface code that loads models, you have to use zero_to_fp32.py but absolutely having the file named "pytorch_model.bin" (dummy mistake from my side… :sweat:).

It still can load even without the pytorch_model.bin.index.json

Hope this will help someone someday :slight_smile: