Hi everyone, I am trying to continue training Llama-2-hf both chat and non chat version with a custom dataset crawled on the internet using Transformer together with PEFT and QLORA. However with different configs, I experienced a very weird loss curve as such:
- Chat version: .
- Non-chat version: Similar pattern but can’t add it due to being a new member.
The configs that I use in each version:
- Chat:
train_name: baseline
model_source: NousResearch
model_name: Llama-2-7b-chat-hf
bnb_cfg:
load_in_4bit: True
bnb_4bit_use_double_quant: True
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: torch.bfloat16
lora_cfg:
peft_type: null
auto_mapping: null
base_model_name_or_path: null
revision: null
task_type: CAUSAL_LM
inference_mode: False
r: 64
target_modules: null
lora_alpha: 32
lora_dropout: 0.05
fan_in_fan_out: False
bias: none
modules_to_save: null
init_lora_weights: True
layers_to_transform: null
layers_pattern: null
train_cfg:
per_device_train_batch_size: 8
gradient_accumulation_steps: 16
warmup_ratio: 0.03
max_steps: -1
learning_rate: 8.e-5
weight_decay: 0.0001
fp16: True
logging_steps: 1
num_train_epochs: 5
optim: paged_adamw_32bit
evaluation_strategy: steps
lr_scheduler_type : constant
do_train: True
do_eval: True
eval_steps: 200
save_strategy: steps
save_steps: 100
group_by_length: True
dataloader_num_workers: 0
dataloader_drop_last: True
ddp_find_unused_parameters: False
max_seq_length: 512
- Non-chat:
train_name: baseline
model_source: NousResearch
model_name: Llama-2-7b-hf
bnb_cfg:
load_in_4bit: True
bnb_4bit_use_double_quant: True
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: torch.bfloat16
lora_cfg:
peft_type: null
auto_mapping: null
base_model_name_or_path: null
revision: null
task_type: CAUSAL_LM
inference_mode: False
r: 16 #experience with rank = 8 but similar loss pattern
target_modules: null
lora_alpha: 32
lora_dropout: 0.05
fan_in_fan_out: False
bias: none
modules_to_save: null
init_lora_weights: True
layers_to_transform: null
layers_pattern: null
train_cfg:
per_device_train_batch_size: 8
gradient_accumulation_steps: 16
warmup_ratio: 0.03
max_steps: -1
learning_rate: 8.e-5
weight_decay: 0.0001
fp16: True
logging_steps: 1
num_train_epochs: 5
optim: paged_adamw_32bit
evaluation_strategy: steps
lr_scheduler_type : constant
do_train: True
do_eval: True
eval_steps: 200
save_strategy: steps
save_steps: 100
group_by_length: True
dataloader_num_workers: 0
dataloader_drop_last: True
ddp_find_unused_parameters: False
max_seq_length: 512
Can anyone explain why this is happening and provide me with any suggestion to improve the continue training process? Thank everyone a lot! I am looking forward to your responses!