Poor results (val_loss) on fine-tuning the NLLB-200-600M with LoRA for French-Wolof translation

allan18 · September 30, 2024, 9:03pm

I am encountering poor validation loss results while fine-tuning the NLLB-200-600M model with LoRA for French-Wolof translation. Despite training for several epochs, the validation loss remains significantly high compared to the training loss, which suggests that the model struggles to generalize on unseen data. The training loss decreases steadily, but the validation loss does not improve as expected.

I am using the following setup:

Model: NLLB-200-600M
Dataset: French-Wolof translation pairs
Fine-tuning method: LoRA (Low-Rank Adaptation)
Learning rate: 2e-5
Batch size: 16

I would appreciate any advice on why the model might be underperforming on the validation set and how to improve the fine-tuning process

John6666 · October 1, 2024, 12:54am

Could this possibly be related to the following issue? If so, it is a bug in the PEFT library and seems to be fixed in the latest version.

rbelanec · October 1, 2024, 7:43am

Most likely not. As far as I checked the bug only affected the prompt learning PEFT methods like Prefix Tuning, Prompt Tuning, P-Tuning, or MPT. Also, the problem was only with the generate function, not the forward function (both train and eval loss were decreasing).

This seems more like an overfitting issue. I would suggest also including weight decay or maybe including more details like how the preprocessed data looks like, what trainer, and what args you are using.

allan18 · October 1, 2024, 10:00am

Thank you for the response. Here is my configuration:

model_checkpoint = 'facebook/nllb-200-distilled-600M'

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# ## Apply LoRA

from peft import get_peft_model, LoraConfig

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
    target_modules=["q_proj", "v_proj"]
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)

# Tokenize a pair of sentences
tokenizer(["This is a sentence!", "This is another sentence."])

max_length = 128
max_input_length = 128
max_target_length = 128
source_lang = "src"
target_lang = "tgt"

def preprocess_function(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True)

    # Configure the tokenizer for targets
    labels = tokenizer(targets, max_length=max_target_length, truncation=True, padding=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Example usage of preprocess_function
preprocess_function(data['train'][:1])

# Tokenize the entire dataset
tokenized_dataset = data.map(preprocess_function, batched=True, batch_size=100)

from transformers import (DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, EarlyStoppingCallback)

# Training arguments
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
source_lang = 'fr'
target_lang = 'wo'

model_checkpoint = "models1/{}-finetuned-{}-to-{}".format(model_name, source_lang, target_lang)

args = Seq2SeqTrainingArguments(
    model_checkpoint,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=5,
    num_train_epochs=3,
    predict_with_generate=True,
    report_to='all',
    lr_scheduler_type="linear",
    warmup_steps=500,
    load_best_model_at_end=True
)

# Data collator for sequence-to-sequence models
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

import numpy as np

# Post-process text by removing unnecessary spaces
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

# Compute evaluation metrics such as BLEU score
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    
    return result

# Initialize Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Start training
trainer.train()

Topic		Replies	Views
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3234	September 22, 2023
LLaMa3.1 8B Instruct Prompt Tuning for Text Classification doesn't improve test accuracy Models	3	765	October 1, 2024
Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K Beginners	5	1071	December 5, 2024
Fine-tuning don't work / bad results Beginners	5	1672	January 15, 2025
Making fine-tuned LLM model more stable Beginners	3	981	December 30, 2023

Poor results (val_loss) on fine-tuning the NLLB-200-600M with LoRA for French-Wolof translation

Related topics