Proper Adapter Loading for SFT and DPO

nyusen · January 7, 2025, 9:21pm

I am trying to run multiple rounds of fine tuning on llama2-7b-chat. I completed the first round of fine tuning and got adapter weights (let’s say I named the model name/llama-2-7b-chat-guanaco). I’m starting to get confused during the second round of fine tuning, in which I want to fine-tune on top of the already-fine-tuned model. I assumed that I’d be able to load the fine-tuned model like so and run the DPO pipeline:

model = AutoModelForCausalLM.from_pretrained("name/llama-2-7b-chat-guanaco", quantization_config=quant_config, device_map={"": 0})
model = PeftModel.from_pretrained(model, "name/llama-2-7b-chat-guanaco")

I trained and was able to push this new model to huggingface (let’s call it name/llama-2-7b-chat-guanaco-dpo). Oddly, the safetensors file is twice as large as the first pass, so I would appreciate some clarity on whether this is expected. But now I’m trying to serve that model for inference. Do I need to load the models in order to correctly sequence the adapters? Something like this:

first_finetune_path = "name/llama-2-7b-chat-guanaco"
  
config = PeftConfig.from_pretrained(first_finetune_path)
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    load_in_8bit=True,
    device_map="auto"
)
self.model = PeftModel.from_pretrained(base_model, first_finetune_path)

# Path to the second fine-tuned model (DPO adapters)
second_finetune_path = "name/llama-2-7b-chat-guanaco-dpo"
  
# Load the second fine-tuned adapters (DPO)
self.model = PeftModel.from_pretrained(self.model, second_finetune_path)

Does it not need to be in sequence? I would appreciate any guidance on how to think about adapters. Thanks!

John6666 · January 9, 2025, 3:09pm

Oddly, the safetensors file is twice as large as the first pass

Is the model saved in fp32 precision? If so, I think the size will be doubled.

github.com/huggingface/transformers

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

opened 08:46AM - 18 Apr 23 UTC

closed 12:14PM - 26 Jun 23 UTC

ArvinZhuang

### System Info - `transformers` version: 4.28.0.dev0 - Platform: Linux-4.18….0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.17 - Python version: 3.8.16 - Huggingface_hub version: 0.13.3 - Safetensors version: not installed - PyTorch version (GPU?): 1.12.1+cu116 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: yes - Using distributed or parallel set-up in script?: yes ### Who can help? @stas00 @sgugger ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction I'm using Trainer with deepspeed integration to fine-tune a Llama model. This is the stage2 config im using: ```json { "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto" } ``` So I'm using zero2 with optimizer offload. I found the size of the model checkpoints after `trainer.train()` become much larger than what they should be. Using official [run_clm.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) script as an example : ```bash deepspeed --num_gpus=1 run_clm.py \ --num_train_epochs 0.01 \ --model_name_or_path decapoda-research/llama-7b-hf \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 2 \ --do_train \ --output_dir /tmp/test-plm \ --deepspeed ds_config.json ``` I add these two save_model lines around `trainer.train()` for testing: ```python trainer.save_model("test1") train_result = trainer.train(resume_from_checkpoint=checkpoint) trainer.save_model("test2") ``` Now check the size: ```bash du -sh test1 26G test1 du -sh test2 76G test2 ``` Note, I have deleted `global_step*` folder in `test2` before calculating the size. I believe 26G is the correct size for an fp32 llama 7b. So, after training with trainer, the model size is wrong? Interestingly, seems the wrong size model still works with `.from_pretrain`. I have located the issue raised after this [line](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/deepspeed.py#L378), which changed the model assignment in trainer `_inner_training_loop` [here](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/trainer.py#L1733) afterward. After this the model saved by `trainer._save()` will have the wrong size. Does deepspeed engine add some extra things to pytorch_model.bin? is this expected? My current solution to this is always using `self.deepspeed.save_16bit_model()` in [trainer.save_model()](https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/trainer.py#L2771) for zerostage2: ```python elif self.deepspeed: # this takes care of everything as long as we aren't under zero3 if self.args.should_save: self._save(output_dir) if is_deepspeed_zero3_enabled(): # It's too complicated to try to override different places where the weights dump gets # saved, so since under zero3 the file is bogus, simply delete it. The user should # either user deepspeed checkpoint to resume or to recover full weights use # zero_to_fp32.py stored in the checkpoint. if self.args.should_save: file = os.path.join(output_dir, WEIGHTS_NAME) if os.path.isfile(file): # logger.info(f"deepspeed zero3: removing {file}, see zero_to_fp32.py to recover weights") os.remove(file) # now save the real model if stage3_gather_16bit_weights_on_model_save=True # if false it will not be saved. # This must be called on all ranks if not self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME): logger.warning( "deepspeed.save_16bit_model didn't save the model, since" " stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use" " zero_to_fp32.py to recover weights" ) self.deepspeed.save_checkpoint(output_dir) else: if self.args.should_save: for filename in os.listdir(output_dir): full_filename = os.path.join(output_dir, filename) # If we have a shard file that is not going to be replaced, we delete it, but only from the main process # in distributed settings to avoid race conditions. weights_no_suffix = WEIGHTS_NAME.replace(".bin", "").replace(".safetensors", "") # delete everything start with weights_no_suffix, usually are "pytorch_model". if ( filename.startswith(weights_no_suffix) and os.path.isfile(full_filename) ): os.remove(full_filename) self.deepspeed.save_16bit_model(output_dir, WEIGHTS_NAME) ``` ### Expected behavior Model checkpoint size should be unchanged after `trainer.train()`

Does it not need to be in sequence? I would appreciate any guidance on how to think about adapters.

While I can’t say for sure that it’s completely unrelated, in general, changing the order in which you apply the adapters rarely causes any problems.

Whether you call it an adapter or LoRA, PEFT is in charge of that part in Hugging Face, so you can understand the whole story by reading PEFT-related literature.

By the way, there are cases where errors occur when applying LoRA in the model quantized state, so if you have any problems applying LoRA in the model quantized state, try applying it after first de-quantizing it, and then quantizing it again.

Topic		Replies	Views
Loaded adapter seems ignored Beginners	0	187	May 24, 2024
Using LoRA Adapters Beginners	0	2165	January 24, 2024
How to perform finetuning on llama2 adapters Models	0	325	September 15, 2023
Fine tune a finetuned model Beginners	1	564	December 16, 2024
AutoModelForCausalLM() to HuggingFaceLLM Beginners	2	2962	October 4, 2024

Proper Adapter Loading for SFT and DPO

Related topics