Avoid saving deepspeed optimizer and model states at checkpoints

John6666 · February 17, 2025, 6:30am

It looks really hard…

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

opened 08:46AM - 18 Apr 23 UTC

closed 12:14PM - 26 Jun 23 UTC

### System Info - `transformers` version: 4.28.0.dev0 - Platform: Linux-4.18….0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.17 - Python version: 3.8.16 - Huggingface_hub version: 0.13.3 - Safetensors version: not installed - PyTorch version (GPU?): 1.12.1+cu116 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: yes - Using distributed or parallel set-up in script?: yes ### Who can help? @stas00 @sgugger ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction I'm using Trainer with deepspeed integration to fine-tune a Llama model. This is the stage2 config im using: ```json { "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto" } ``` So I'm using zero2 with optimizer offload. I found the size of the model checkpoints after `trainer.train()` become much larger than what they should be. Using official [run_clm.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) script as an example : ```bash deepspeed --num_gpus=1 run_clm.py \ --num_train_epochs 0.01 \ --model_name_or_path decapoda-research/llama-7b-hf \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 2 \ --do_train \ --output_dir /tmp/test-plm \ --deepspeed ds_config.json ``` I add these two save_model lines around `trainer.train()` for testing: ```python trainer.save_model("test1") train_result = trainer.train(resume_from_checkpoint=checkpoint) trainer.save_model("test2") ``` Now check the size: ```bash du -sh test1 26G test1 du -sh test2 76G test2 ``` Note, I have deleted `global_step*` folder in `test2` before calculating the size. I believe 26G is the correct size for an fp32 llama 7b. So, after training with trainer, the model size is wrong? Interestingly, seems the wrong size model still works with `.from_pretrain`. I have located the issue raised after this [line](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/deepspeed.py#L378), which changed the model assignment in trainer `_inner_training_loop` [here](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/trainer.py#L1733) afterward. After this the model saved by `trainer._save()` will have the wrong size. Does deepspeed engine add some extra things to pytorch_model.bin? is this expected? My current solution to this is always using `self.deepspeed.save_16bit_model()` in [trainer.save_model()](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/trainer.py#L2771) for zerostage2: ```python elif self.deepspeed: # this takes care of everything as long as we aren't under zero3 if self.args.should_save: self._save(output_dir) if is_deepspeed_zero3_enabled(): # It's too complicated to try to override different places where the weights dump gets # saved, so since under zero3 the file is bogus, simply delete it. The user should # either user deepspeed checkpoint to resume or to recover full weights use # zero_to_fp32.py stored in the checkpoint. if self.args.should_save: file = os.path.join(output_dir, WEIGHTS_NAME) if os.path.isfile(file): # logger.info(f"deepspeed zero3: removing {file}, see zero_to_fp32.py to recover weights") os.remove(file) # now save the real model if stage3_gather_16bit_weights_on_model_save=True # if false it will not be saved. # This must be called on all ranks if not self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME): logger.warning( "deepspeed.save_16bit_model didn't save the model, since" " stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use" " zero_to_fp32.py to recover weights" ) self.deepspeed.save_checkpoint(output_dir) else: if self.args.should_save: for filename in os.listdir(output_dir): full_filename = os.path.join(output_dir, filename) # If we have a shard file that is not going to be replaced, we delete it, but only from the main process # in distributed settings to avoid race conditions. weights_no_suffix = WEIGHTS_NAME.replace(".bin", "").replace(".safetensors", "") # delete everything start with weights_no_suffix, usually are "pytorch_model". if ( filename.startswith(weights_no_suffix) and os.path.isfile(full_filename) ): os.remove(full_filename) self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME) ``` ### Expected behavior Model checkpoint size should be unchanged after `trainer.train()`

Topic		Replies	Views
Trainer option to disable saving DeepSpeed checkpoints 🤗Transformers	8	6481	May 23, 2023
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2805	March 6, 2024
Corrupted deepspeed checkpoint DeepSpeed	1	152	March 13, 2025
Saving weights while finetuning is on DeepSpeed	0	98	June 13, 2024
Save accelerate model 🤗Accelerate	4	750	February 5, 2025

Avoid saving deepspeed optimizer and model states at checkpoints

Related topics