I rewrite trainer’s save_model
method below
def save_model(self, output_dir=None, _internal_call=False):
os.makedirs(output_dir, exist_ok=True)
self.model.save_pretrained(output_dir)
it seems save_pretrained
has default max_shard_size=10GB
so I expect 2 bin files each less than 10GB. however I get one 14GB pytorch_model.bin. why?
And, I find that if I didn’t rewrite save_model
, it behave normal. and the execute code in trainer.py line 2784
elif self.is_deepspeed_enabled:
# this takes care of everything as long as we aren't under zero3
if version.parse(accelerate_version) <= version.parse("0.20.3"):
raise ValueError("Install Accelerate from main branch")
try:
state_dict = self.accelerator.get_state_dict(self.deepspeed)
if self.args.should_save:
self._save(output_dir, state_dict=state_dict)
except ValueError:
logger.warning(
" stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use"
" zero_to_fp32.py to recover weights"
)
self._save(output_dir, state_dict={})
# remove the dummy state_dict
remove_dummy_checkpoint(self.args.should_save, output_dir, [WEIGHTS_NAME, SAFE_WEIGHTS_NAME])
self.model_wrapped.save_checkpoint(output_dir)
state_dict = self.accelerator.get_state_dict(self.deepspeed)
what’s the difference between self.accelerator.get_state_dict(self.deepspeed)
and self.model.state_dict()
?