Different metrics score between when training and when merge lora adapter testing

nguyen-brat · October 25, 2024, 10:25am

Hello i currently use hf trainer + lora + deepspeed zero2 for training. but I got problem that the metrics score f1 I got during eval training is higher than the metrics I got (same data) when I merge lora adapter and test it. I also add the option load best model at the end. can any body tell my what I my merge step do wrong:

trainer = Trainer(
    model=model,
    args=train_args,
    data_collator=collator,
    train_dataset=train_dataloader,
    eval_dataset=val_dataloader if training_args.do_eval else None,
    compute_metrics=compute_metrics # if training_args.do_eval else None,
)
if training_args.do_eval and callback_args.get("patient", None):
    trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=callback_args.patient))

# Training
checkpoint = False
if training_args.resume_from_checkpoint is not None:
    checkpoint = training_args.resume_from_checkpoint
trainer.train(resume_from_checkpoint=checkpoint)

if trainer.is_local_process_zero() and use_lora:
    model = trainer.model.merge_and_unload()
    model.save_pretrained(
        f"{training_args.output_dir}/merged_model",
        safe_serialization=True,
        max_shard_size="5GB"  # Adjust this value based on your needs
    )

here is my training config:

data_args:
  train_data:
    # annotate_path: "data/public_train/ocr_llm_fix.json"
    # image_path: "data/public_train/train-images"
    annotate_path: "data/warn_up/ocr_llm.json"
    image_path: "data/warn_up/warmup-images"
  val_data:
    annotate_path: "data/warn_up/ocr_llm.json"
    image_path: "data/warn_up/warmup-images"
  max_length: 2500
  # min_pixels: 200704
  # max_pixels: 1003520
  # cache_dir: "./data_cache"

training_args:
  freeze_base: False
  bf16: True
  use_lora: True
  learning_rate: 5e-5
  lr_scheduler_type: "cosine"
  warmup_ratio: 0.1
  weight_decay: 0.0
  gradient_checkpointing: true
  gradient_checkpointing_kwargs: {"use_reentrant": False}
  save_safetensors: False
  torch_compile_backend: "cudagraphs"
  use_liger_kernel: False # BUG if True
  quantization: 0 # 0 mean not use quantization 4 mean 4 bit 8 mean 8 bit
  output_dir: "model/dump/dummp_1"
  overwrite_output_dir: true
  num_train_epochs: 1
  per_device_train_batch_size: 2
  per_device_eval_batch_size: 2
  gradient_accumulation_steps: 8
  dataloader_num_workers: 36
  do_train: true
  do_eval: true
  eval_strategy: "steps" # Evaluation is done (and logged) every eval_steps
  logging_strategy: "steps"
  save_strategy: "steps"
  eval_steps: 1
  save_steps: 1
  save_total_limit: 5
  # seed: null
  label_names: ["labels"]
  resume_from_checkpoint: False
  load_best_model_at_end: True
  prediction_loss_only: False
  metric_for_best_model: "f1"
  neftune_noise_alpha: 5

model_args:
  base_model: "LLaMA-Factory/models/qwen2_vl_lora_sft_v2"
  extra_layers: 4
  num_class: 4
  model_kwargs:
      token: "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"
      use_cache: False
      attn_implementation: "flash_attention_2" # flash_attention_2, sdpa, eager
  attn_implementation: "flash_attention_2"

tokenizer_args:
  cache_dir: "./tokenizer_data"
  token: "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"

lora_args:
  use_dora: False
  r: 16
  lora_alpha: 32
  lora_dropout: 0.05
  bias: "none"
  target_modules: [
    .*\.down_proj,  # Add $ to match end of string
    .*\.gate_proj,
    .*\.v_proj,
    .*\.up_proj,
    .*\.k_proj,
    .*\.o_proj,
    .*\.q_proj
  ] # "all-linear"
  modules_to_save: [
    ^encoder_layers.*,  # Add ^ to match start of string
    ^classification_layer
  ]

wandb_args:
  run_name: "qwen2-vl-reasoning-tuned-based_dump" # "My goodfellas"
  logging_steps: 1
  report_to: "wandb"

callback_args:
  patient: null

huggingface_args:
  push_to_hub: false
  hub_private_repo: true
  hub_model_id: "qwen2vl7b-cls"

John6666 · October 25, 2024, 10:41am

I’m not sure, so I searched anyway.

Most LoRA merges in HF libraries use PEFT internally, so when LoRA doesn’t work well, it could be a PEFT version issue, etc. There was a bug in a slightly earlier version. However, this is unlikely if the training and execution environments are the same…
Also, I’ve been seeing some problems related to deepspeed on the forums recently. It could be that it is simply because it is so widespread, but it is also possible that there is some kind of bug.

Topic		Replies	Views
DeepSpeed Zero 3 with LoRA - Merging adapters DeepSpeed	1	671	August 16, 2024
LoRA training with accelerate / deepspeed DeepSpeed	3	2337	May 28, 2025
Trainer do not log validation loss and metrics Beginners	5	1633	September 1, 2024
Load model from checkpoints occurs degraded performance Beginners	2	786	July 7, 2023
Fine Tune with/without LORA 🤗Transformers	1	231	October 7, 2024

Different metrics score between when training and when merge lora adapter testing

Related topics