Hi all,
I’m a relative beginner so I might not no exactly what I’m doing here but after reading some code / documentations I tried my hand at fine-tuning a Llama-based model for multi-label regression. During training, loss seemed really good at approximately 0.05, but upon inference, it looks like loss is a lot higher. During training I used qlora approximately as follows:
base_model_id = "GeneZC/MiniChat-3B"
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
config = AutoConfig.from_pretrained(
base_model_id,
num_labels=6,
problem_type='regression',
finetuning_task='custom'
)
model = AutoModelForSequenceClassification.from_pretrained(base_model_id, device_map={'':torch.cuda.current_device() }, config=config, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True, device_map={'':torch.cuda.current_device() })
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
model.enable_input_require_grads()
model = get_peft_model(model, config)
import transformers
from datetime import datetime
# Train
from transformers import AdamW
import torch
trainer = transformers.Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
args=transformers.TrainingArguments(
output_dir=output_dir,
warmup_steps=1,
per_device_train_batch_size=12,
per_device_eval_batch_size=24,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
max_steps=500,
learning_rate=5e-5,
fp16=True,
optim='adamw_8bit',
logging_steps=5,
logging_dir="./logs",
save_strategy="steps",
save_steps=50,
evaluation_strategy="steps",
eval_steps=100,
do_eval=True,
report_to="wandb",
run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"
),
data_collator=transformers.default_data_collator,
)
trainer.train()
merged_model = trainer.model.merge_and_unload()
I got a loss curve of approximately 0.05 which seems to have been calculated using MSError per transformers/src/transformers/models/llama/modeling_llama.py at main · huggingface/transformers · GitHub.
When I perform inference as follows, I get results of [-1.4639, -0.5625, 1.0566, 0.2532, -1.2383, -0.3762]
instead of the actual labels in the training data of [4.347013533533275, 4.345919104895332, 4.3561177652220024, 4.30447411005213, 4.205659945769777, 4.146060915580687]
. This complete mismatch between inference data and label data is true for a large sample of training data I selected for manual testing that I didn’t use during the trainer’s training process, but it is also true for a bunch of other random samples. I would expect this to create a much larger MSE loss.
I performed inference using this code:
with torch.no_grad():
preds = dataset['train'][10000:10010]
tensor_ids = torch.tensor(preds["input_ids"], dtype=torch.long)
attention_mask = torch.tensor(preds["attention_mask"], dtype=torch.long)
outputs = merged_model(input_ids=tensor_ids, attention_mask=attention_mask)
logits = outputs.logits
Did I do something wrong, or is my model simply not good enough to give me accurate results?