Saving checkpoint is too slow with deepspeed

A custom model is nn.Module with self.lm(large language model, CLM). I’m trying to train the model with deepspeed zero3. I’m using A100 x8 gpus. I made ds_config.json file and ran sh file like this.
deepspeed finetune/run.py --num_gpus=8 --deepspeed ds_config.json ...

Also, I changed Trainer’s method “_save” like this.

  def _save(self, output_dir: Optional[str] = None):
      output_dir = output_dir if output_dir is not None else self.args.output_dir
      os.makedirs(output_dir, exist_ok=True)
      logger.info("Saving model checkpoint to %s", output_dir)

      state_dict = self.accelerator.get_state_dict(self.deepspeed)
      self.model.lm.save_pretrained(output_dir, state_dict=state_dict)

      self.tokenizer.save_pretrained(output_dir)

      # Good practice: save your training arguments together with the trained model
      torch.save(self.args, os.path.join(output_dir, "training_args.bin"))

But, saving checkpoint doesn’t finish, and I couldn’t get pytorch_model.bin file. I think it’s too slow to save checkpoint gathering all models’ parameters from 8 gpus.
How can I solve this problem?

1 Like

Hi, I have the same problem with saving codet5-6b with zero3. The logs says pytorch_model.bin is saved but it is not there and the process hangs. Did you find a solution?