opened 08:46AM - 18 Apr 23 UTC
closed 12:14PM - 26 Jun 23 UTC
### System Info
- `transformers` version: 4.28.0.dev0
- Platform: Linux-4.18….0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.13.3
- Safetensors version: not installed
- PyTorch version (GPU?): 1.12.1+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
### Who can help?
@stas00 @sgugger
### Information
- [X] The official example scripts
- [X] My own modified scripts
### Tasks
- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)
### Reproduction
I'm using Trainer with deepspeed integration to fine-tune a Llama model.
This is the stage2 config im using:
```json
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
```
So I'm using zero2 with optimizer offload. I found the size of the model checkpoints after `trainer.train()` become much larger than what they should be.
Using official [run_clm.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) script as an example :
```bash
deepspeed --num_gpus=1 run_clm.py \
--num_train_epochs 0.01 \
--model_name_or_path decapoda-research/llama-7b-hf \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 2 \
--do_train \
--output_dir /tmp/test-plm \
--deepspeed ds_config.json
```
I add these two save_model lines around `trainer.train()` for testing:
```python
trainer.save_model("test1")
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model("test2")
```
Now check the size:
```bash
du -sh test1
26G test1
du -sh test2
76G test2
```
Note, I have deleted `global_step*` folder in `test2` before calculating the size.
I believe 26G is the correct size for an fp32 llama 7b. So, after training with trainer, the model size is wrong? Interestingly, seems the wrong size model still works with `.from_pretrain`.
I have located the issue raised after this [line](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/deepspeed.py#L378), which changed the model assignment in trainer `_inner_training_loop` [here](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/trainer.py#L1733) afterward. After this the model saved by `trainer._save()` will have the wrong size.
Does deepspeed engine add some extra things to pytorch_model.bin? is this expected?
My current solution to this is always using `self.deepspeed.save_16bit_model()` in [trainer.save_model()](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/trainer.py#L2771) for zerostage2:
```python
elif self.deepspeed:
# this takes care of everything as long as we aren't under zero3
if self.args.should_save:
self._save(output_dir)
if is_deepspeed_zero3_enabled():
# It's too complicated to try to override different places where the weights dump gets
# saved, so since under zero3 the file is bogus, simply delete it. The user should
# either user deepspeed checkpoint to resume or to recover full weights use
# zero_to_fp32.py stored in the checkpoint.
if self.args.should_save:
file = os.path.join(output_dir, WEIGHTS_NAME)
if os.path.isfile(file):
# logger.info(f"deepspeed zero3: removing {file}, see zero_to_fp32.py to recover weights")
os.remove(file)
# now save the real model if stage3_gather_16bit_weights_on_model_save=True
# if false it will not be saved.
# This must be called on all ranks
if not self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME):
logger.warning(
"deepspeed.save_16bit_model didn't save the model, since"
" stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use"
" zero_to_fp32.py to recover weights"
)
self.deepspeed.save_checkpoint(output_dir)
else:
if self.args.should_save:
for filename in os.listdir(output_dir):
full_filename = os.path.join(output_dir, filename)
# If we have a shard file that is not going to be replaced, we delete it, but only from the main process
# in distributed settings to avoid race conditions.
weights_no_suffix = WEIGHTS_NAME.replace(".bin", "").replace(".safetensors", "")
# delete everything start with weights_no_suffix, usually are "pytorch_model".
if (
filename.startswith(weights_no_suffix)
and os.path.isfile(full_filename)
):
os.remove(full_filename)
self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME)
```
### Expected behavior
Model checkpoint size should be unchanged after `trainer.train()`