FREQUENT LOSS SPIKING in CONTINUE TRAINING LLM

Hi everyone, I am trying to continue training Llama-2-hf both chat and non chat version with a custom dataset crawled on the internet using Transformer together with PEFT and QLORA. However with different configs, I experienced a very weird loss curve as such:

  • Chat version: Chat_version.
  • Non-chat version: Similar pattern but can’t add it due to being a new member.

The configs that I use in each version:

  • Chat:
train_name: baseline
model_source: NousResearch
model_name: Llama-2-7b-chat-hf
bnb_cfg:
  load_in_4bit: True
  bnb_4bit_use_double_quant: True
  bnb_4bit_quant_type: nf4
  bnb_4bit_compute_dtype: torch.bfloat16
lora_cfg:
  peft_type: null
  auto_mapping: null
  base_model_name_or_path: null
  revision: null
  task_type: CAUSAL_LM
  inference_mode: False
  r: 64
  target_modules: null
  lora_alpha: 32
  lora_dropout: 0.05
  fan_in_fan_out: False
  bias: none
  modules_to_save: null
  init_lora_weights: True
  layers_to_transform: null
  layers_pattern: null
train_cfg:
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 16
  warmup_ratio: 0.03
  max_steps: -1
  learning_rate: 8.e-5
  weight_decay: 0.0001
  fp16: True
  logging_steps: 1
  num_train_epochs: 5
  optim: paged_adamw_32bit
  evaluation_strategy: steps
  lr_scheduler_type :  constant
  do_train: True
  do_eval: True
  eval_steps: 200
  save_strategy: steps
  save_steps: 100
  group_by_length: True
  dataloader_num_workers: 0
  dataloader_drop_last: True
  ddp_find_unused_parameters: False
max_seq_length: 512
  • Non-chat:
train_name: baseline
model_source: NousResearch
model_name: Llama-2-7b-hf
bnb_cfg:
  load_in_4bit: True
  bnb_4bit_use_double_quant: True
  bnb_4bit_quant_type: nf4
  bnb_4bit_compute_dtype: torch.bfloat16
lora_cfg:
  peft_type: null
  auto_mapping: null
  base_model_name_or_path: null
  revision: null
  task_type: CAUSAL_LM
  inference_mode: False
  r: 16  #experience with rank = 8 but similar loss pattern
  target_modules: null
  lora_alpha: 32
  lora_dropout: 0.05
  fan_in_fan_out: False
  bias: none
  modules_to_save: null
  init_lora_weights: True
  layers_to_transform: null
  layers_pattern: null
train_cfg:
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 16
  warmup_ratio: 0.03
  max_steps: -1
  learning_rate: 8.e-5
  weight_decay: 0.0001
  fp16: True
  logging_steps: 1
  num_train_epochs: 5
  optim: paged_adamw_32bit
  evaluation_strategy: steps
  lr_scheduler_type :  constant
  do_train: True
  do_eval: True
  eval_steps: 200
  save_strategy: steps
  save_steps: 100
  group_by_length: True
  dataloader_num_workers: 0
  dataloader_drop_last: True
  ddp_find_unused_parameters: False
max_seq_length: 512

Can anyone explain why this is happening and provide me with any suggestion to improve the continue training process? Thank everyone a lot! I am looking forward to your responses!

1 Like

@hunggggg were you able to solve this? I am facing similar issue, can you please help

It seems odd to me that you arent targeting any modules with LoRa