I’m trying to fine-tune a model over several days because I have time limitations. So a few epochs one day, a few epochs the next, etc. However, every time I try to load the adapter config file resulting from the previous training session, the model that loads is the base model, as if no fine-tuning had occurred! I’m not sure what is happening. Does anyone have any advice on how to change this? Is it a result of my saving strategy and using patience?
My training arguments are as follows:
#Load local model
adapter_path= "./model"
model= AutoModelForCausalLM.from_pretrained(
adapter_path,
quantization_config= quantization_config,
device_map={"": 0}, token= huggingface_token
)
model.config.use_cache = False
model.config.pretraining_tp = 1
training_params = TrainingArguments(
evaluation_strategy= "epoch",
save_strategy= "epoch",
logging_strategy= "epoch",
num_train_epochs=3,
output_dir="./newresults",
per_device_train_batch_size=1,
gradient_accumulation_steps=80,
optim="adamw_torch",
learning_rate=1e-5,
weight_decay=0.002,
fp16=True,
bf16=False,
max_grad_norm=0.3,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant",
report_to= "tensorboard",
load_best_model_at_end= True,
metric_for_best_model= "eval_loss",
greater_is_better= False
)
trainer= SFTTrainer(
model=model,
data_collator=data_collator,
train_dataset=new_dataset,
peft_config=peft_params,
dataset_text_field= "input",
tokenizer= tokenizer,
args=training_params,
eval_dataset= valid_set,
packing= False,
callbacks= [callback]
)
trainer.train()
trainer.save_model("./model")
)```
This should save the model at every epoch in my local ./newresults directory, and it should save the final fine-tuned model in ./model. But when I try to load from either directory for the next round of training, the model that is loaded is the base model, not the fine-tuned one. Why might be the reason? Also, is there a way to distinguish which model is loaded before training? Right now I can tell because the re-loaded model's loss after an epoch of training is exactly what it was after the very first round of training from the base model.