A custom model is nn.Module with self.lm(large language model, CLM). I’m trying to train the model with deepspeed zero3. I’m using A100 x8 gpus. I made ds_config.json file and ran sh file like this.
deepspeed finetune/run.py --num_gpus=8 --deepspeed ds_config.json ...
Also, I changed Trainer’s method “_save” like this.
def _save(self, output_dir: Optional[str] = None):
output_dir = output_dir if output_dir is not None else self.args.output_dir
os.makedirs(output_dir, exist_ok=True)
logger.info("Saving model checkpoint to %s", output_dir)
state_dict = self.accelerator.get_state_dict(self.deepspeed)
self.model.lm.save_pretrained(output_dir, state_dict=state_dict)
self.tokenizer.save_pretrained(output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
But, saving checkpoint doesn’t finish, and I couldn’t get pytorch_model.bin file. I think it’s too slow to save checkpoint gathering all models’ parameters from 8 gpus.
How can I solve this problem?