Saving checkpoint is too slow with deepspeed

quovadiss · August 9, 2023, 8:36am

A custom model is nn.Module with self.lm(large language model, CLM). I’m trying to train the model with deepspeed zero3. I’m using A100 x8 gpus. I made ds_config.json file and ran sh file like this.
deepspeed finetune/run.py --num_gpus=8 --deepspeed ds_config.json ...

Also, I changed Trainer’s method “_save” like this.

  def _save(self, output_dir: Optional[str] = None):
      output_dir = output_dir if output_dir is not None else self.args.output_dir
      os.makedirs(output_dir, exist_ok=True)
      logger.info("Saving model checkpoint to %s", output_dir)

      state_dict = self.accelerator.get_state_dict(self.deepspeed)
      self.model.lm.save_pretrained(output_dir, state_dict=state_dict)

      self.tokenizer.save_pretrained(output_dir)

      # Good practice: save your training arguments together with the trained model
      torch.save(self.args, os.path.join(output_dir, "training_args.bin"))

But, saving checkpoint doesn’t finish, and I couldn’t get pytorch_model.bin file. I think it’s too slow to save checkpoint gathering all models’ parameters from 8 gpus.
How can I solve this problem?

berkayberabi · September 11, 2023, 4:44pm

Hi, I have the same problem with saving codet5-6b with zero3. The logs says pytorch_model.bin is saved but it is not there and the process hangs. Did you find a solution?

theblackhacker · November 14, 2023, 8:48am

With 8xA100, i think you should want to use ZeRO-2, it’ll be a lot faster in training and in model saving too.

The reason ZeRO-3 saves the model is very slow because it only uses 1 GPU[0] to combine all layers, and only combines 1 layer at a time.

jakobsal · March 2, 2024, 5:20pm

I’m using 4xL4 with ZeRO-2 to fine-tune sdxl. All 4 GPUs are running at 100% when the checkpoint is being saved but nothing is saved despite waiting for 30 mins. Is this expected or am I doing something wrong here?

jax2000 · March 6, 2024, 7:39am

I’m in a situation just like yours, do you have a good solution for it now? @jakobsal

jakobsal · March 6, 2024, 8:23am

Hey @jax2000, I’m getting back to this today so not yet. What I understand so far though is that you might have to set some environmental variables that will depend on your setup. I’ve found this Github thread to contain some interesting stuff to try out. This NVIDIA env variables docs is probably also useful.

Topic		Replies	Views
Questions about deepspeed resume training 🤗Accelerate	2	2055	October 21, 2022
Avoid saving deepspeed optimizer and model states at checkpoints Beginners	2	432	February 19, 2025
Rewrite trainer's save_model method get unexpected pytorch_model.bin file DeepSpeed	0	399	January 8, 2024
Trainer option to disable saving DeepSpeed checkpoints 🤗Transformers	8	6482	May 23, 2023
Saving CHECKPOINTS takes way too long Beginners	0	111	September 2, 2024

Saving checkpoint is too slow with deepspeed

Related topics