Fine-tuning LLM for regression yields low loss during training but not in inference?

yehudacohen · December 21, 2023, 12:23am

Hi all,

I’m a relative beginner so I might not no exactly what I’m doing here but after reading some code / documentations I tried my hand at fine-tuning a Llama-based model for multi-label regression. During training, loss seemed really good at approximately 0.05, but upon inference, it looks like loss is a lot higher. During training I used qlora approximately as follows:

base_model_id = "GeneZC/MiniChat-3B"
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
config = AutoConfig.from_pretrained(
    base_model_id,
    num_labels=6,
    problem_type='regression',
    finetuning_task='custom'
)

model = AutoModelForSequenceClassification.from_pretrained(base_model_id, device_map={'':torch.cuda.current_device() }, config=config, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True, device_map={'':torch.cuda.current_device() })

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)
model.enable_input_require_grads()
model = get_peft_model(model, config)
import transformers
from datetime import datetime

# Train
from transformers import AdamW
import torch

trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=1,
        per_device_train_batch_size=12,
        per_device_eval_batch_size=24,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        max_steps=500,
        learning_rate=5e-5, 
        fp16=True,
        optim='adamw_8bit',
        logging_steps=5,
        logging_dir="./logs",  
        save_strategy="steps",     
        save_steps=50, 
        evaluation_strategy="steps",
        eval_steps=100,               
        do_eval=True,
        report_to="wandb",
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"
    ),

    data_collator=transformers.default_data_collator,
)

trainer.train()
merged_model = trainer.model.merge_and_unload()

I got a loss curve of approximately 0.05 which seems to have been calculated using MSError per transformers/src/transformers/models/llama/modeling_llama.py at main · huggingface/transformers · GitHub.

When I perform inference as follows, I get results of [-1.4639, -0.5625, 1.0566, 0.2532, -1.2383, -0.3762] instead of the actual labels in the training data of [4.347013533533275, 4.345919104895332, 4.3561177652220024, 4.30447411005213, 4.205659945769777, 4.146060915580687]. This complete mismatch between inference data and label data is true for a large sample of training data I selected for manual testing that I didn’t use during the trainer’s training process, but it is also true for a bunch of other random samples. I would expect this to create a much larger MSE loss.

I performed inference using this code:

with torch.no_grad():
    preds = dataset['train'][10000:10010]
    tensor_ids = torch.tensor(preds["input_ids"], dtype=torch.long)
    attention_mask = torch.tensor(preds["attention_mask"], dtype=torch.long)
    outputs = merged_model(input_ids=tensor_ids, attention_mask=attention_mask)
    logits = outputs.logits

Did I do something wrong, or is my model simply not good enough to give me accurate results?

nielsr · December 22, 2023, 8:18am

Hi,

I see some issues here:

first, it might be better to use a smaller Transformer encoder (like BERT, RoBERTa, etc.) for discriminative tasks, rather than a generative, typically big LLM. Hence I’d recommend to start from a good sequence classifier like roberta-base. This is because discriminative models are able to see the entire text, whereas generative LLMs like LLaMa use a causal attention mask, being able to only look at previous tokens for a given token.
you’re loading a AutoModelForSequenceClassification model, which is correct, as it can be used for classification/regression tasks given an input text. However, when defining your LoraConfig, you’re specifying task_type="CAUSAL_LM", which is not correct, as this model is meant for sequence classification rather than causal language modeling. So the LoraConfig should probably have SEQ_CLS as its task type rather than CAUSAL_LM.

surya-narayanan · March 4, 2024, 9:14pm

Hi,

Thanks @nielsr for the guidance - I am also training a regression model, but find when I print my labels inside the trainer, the model is printing the input_ids instead of the regression labels. Any idea where it might be interpreting the task as a causal lm task? I made the changes in task type in the lora config, and my model is an AutoModelForSequenceClassification

Topic		Replies	Views
Reduced inference f1 score with QLoRA finetuned model Intermediate	1	881	September 6, 2023
LLM training loss fluctuation 🤗Transformers	0	949	November 30, 2023
Making fine-tuned LLM model more stable Beginners	3	983	December 30, 2023
FREQUENT LOSS SPIKING in CONTINUE TRAINING LLM 🤗Transformers	2	1058	April 4, 2024
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3236	September 22, 2023

Fine-tuning LLM for regression yields low loss during training but not in inference?

Related topics