Rewrite trainer's save_model method get unexpected pytorch_model.bin file

chandler-m · January 8, 2024, 9:59am

I rewrite trainer’s save_model method below

def save_model(self, output_dir=None, _internal_call=False):
        os.makedirs(output_dir, exist_ok=True)
        self.model.save_pretrained(output_dir)

it seems save_pretrained has default max_shard_size=10GB so I expect 2 bin files each less than 10GB. however I get one 14GB pytorch_model.bin. why?

And, I find that if I didn’t rewrite save_model , it behave normal. and the execute code in trainer.py line 2784

elif self.is_deepspeed_enabled:
            # this takes care of everything as long as we aren't under zero3
            if version.parse(accelerate_version) <= version.parse("0.20.3"):
                raise ValueError("Install Accelerate from main branch")
            try:
                state_dict = self.accelerator.get_state_dict(self.deepspeed)
                if self.args.should_save:
                    self._save(output_dir, state_dict=state_dict)
            except ValueError:
                logger.warning(
                    " stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use"
                    " zero_to_fp32.py to recover weights"
                )
                self._save(output_dir, state_dict={})
                # remove the dummy state_dict
                remove_dummy_checkpoint(self.args.should_save, output_dir, [WEIGHTS_NAME, SAFE_WEIGHTS_NAME])
                self.model_wrapped.save_checkpoint(output_dir)

state_dict = self.accelerator.get_state_dict(self.deepspeed) what’s the difference between self.accelerator.get_state_dict(self.deepspeed) and self.model.state_dict() ?

Topic		Replies	Views
Model saving results in a small size checkpoint 🤗Transformers	1	624	January 4, 2021
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2805	March 6, 2024
Save only best model in Trainer 🤗Transformers	31	85126	June 25, 2024
Size of saved model: Is there a way to make it smaller for deploy? Beginners	1	593	July 27, 2023
Checkpoint vs model weight Beginners	2	4781	October 12, 2020

Rewrite trainer's save_model method get unexpected pytorch_model.bin file

Related topics