Fine-tuning flan-t5-small

Fine-Tuning Flan-T5-Small: Challenges and Unexpected Results

Problem Description

I am fine-tuning the Flan-T5-Small model on a custom dataset using Hugging Face’s transformers library. Despite following recommended practices, the results are not as expected. Here are the key issues and the steps I have taken:

Fine-Tuning Setup

Training Arguments

training_args = TrainingArguments(
    output_dir=output_dir,
    run_name=f"flan-t5-finetuning-{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    per_device_train_batch_size=1,  
    gradient_accumulation_steps=8, 
    num_train_epochs=3,
    save_steps=500,
    logging_steps=50,
    save_total_limit=1,
    fp16=False, 
    dataloader_num_workers=0,
    gradient_checkpointing=True, 
    report_to=[],
    resume_from_checkpoint=False,
    max_grad_norm=1.0,  
    optim='adamw_torch',
    torch_compile=False,
    learning_rate=2e-5, 
    weight_decay=0.05,   
    lr_scheduler_type="cosine",  
    warmup_steps=100     
)

Dataset Format

The dataset consists of a single column (Combined), where each row contains:

  • An instruction, extracted topic, and the expected output.
  • Example:
    Input: Write a detailed summary of Autism in 23 sentences, each with about 22 words.
    Topic: Autism
    Output: [Expected output text]
    

Observations

  1. Training Loss Behavior:

    • Loss starts high (e.g., 25) and drops rapidly to 1.5 within the first 450 iterations.
    • Suggests the model learns quickly but may not generalize well.
  2. Unstable Results:

    • Outputs are repetitive or lack coherence despite low training loss.

Troubleshooting Steps

  • Reduced learning rate to 2e-5 with a cosine decay scheduler.
  • Increased weight decay to 0.05 for better regularization.
  • Introduced gradient checkpointing to manage memory constraints and enable larger models.

Example Training Progress

[ 12/1356 00:09 < 20:15, 1.11 it/s, Epoch 0.02/3]
Step    Training Loss
50      23.682300
100     26.042400
150     12.555300
200     10.276300

Actual Output :

Input Text: 
Instructions: Write a detailed summary on Topic in 23 sentences, each with about 22 words.

Topic: Autism
Output:

Generated Output: Instructions: Write a detailed summary on Topic in 23 sentences, each with about 22 words. Topic: Autism

Please help me understand where I am going wrong ?

1 Like
  • Problem: The dataset consists of unstructured concatenation of “Instruction,” “Topic,” and “Output” without clear demarcation. This could confuse the model during training, leading it to replicate the input text or fail to generalize.
  • Solution:
  • Add Clear Separators: Use explicit separators like <instruction>, <topic>, and <output> in the dataset. For example:

plaintext

Copy code

<instruction> Write a detailed summary on <topic> Autism in 23 sentences, each with about 22 words. <output> [Expected Output]
  • Ensure your tokenized input-output pairs are clear and well-structured.
2 Likes

Update on Dataset and Tokenization Process

Issue Identified:

The problem was traced to the format of the dataset and the tokenization process.

Changes Made:

  1. Dataset Format:

    • Previously: Single-column dataset with Instruction, Topic, and Output combined.
    • Now: Segregated into two separate columns:
      • Instruction/Topic
      • Output
  2. Tokenization Process:

    • Implemented proper tokenization to convert the Output into tokenized labels.
    • Youtube video that helped me understand how to structure the tokenization function
  3. Controlled Model Output:

    • Adjusted the temperature during model inference for better control over the generated outputs.

Results:

  • While the results are not yet ideal, they are showing promise and moving in the desired direction.

Acknowledgment:

Thank you for your help in guiding me through this process!

New Doubt:

I want to fine-tune the model to generate sentences with coherence based on instructions, such as:

Generate 3 sentences of 10 words each on topic X.

  • How can I ensure the fine-tuned model adheres to these specific constraints while maintaining coherence and relevance?
  • As I have already mentioned how I have structured my dataset, do you think it is a wise approach or I should try some other way ?
1 Like