Fine-tuning flan-t5-small

Dishant31 · December 28, 2024, 6:14am

Fine-Tuning Flan-T5-Small: Challenges and Unexpected Results

Problem Description

I am fine-tuning the Flan-T5-Small model on a custom dataset using Hugging Face’s transformers library. Despite following recommended practices, the results are not as expected. Here are the key issues and the steps I have taken:

Fine-Tuning Setup

Training Arguments

training_args = TrainingArguments(
    output_dir=output_dir,
    run_name=f"flan-t5-finetuning-{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    per_device_train_batch_size=1,  
    gradient_accumulation_steps=8, 
    num_train_epochs=3,
    save_steps=500,
    logging_steps=50,
    save_total_limit=1,
    fp16=False, 
    dataloader_num_workers=0,
    gradient_checkpointing=True, 
    report_to=[],
    resume_from_checkpoint=False,
    max_grad_norm=1.0,  
    optim='adamw_torch',
    torch_compile=False,
    learning_rate=2e-5, 
    weight_decay=0.05,   
    lr_scheduler_type="cosine",  
    warmup_steps=100     
)

Dataset Format

The dataset consists of a single column (Combined), where each row contains:

An instruction, extracted topic, and the expected output.

Example:

Input: Write a detailed summary of Autism in 23 sentences, each with about 22 words.
Topic: Autism
Output: [Expected output text]

Observations

Training Loss Behavior:
- Loss starts high (e.g., 25) and drops rapidly to 1.5 within the first 450 iterations.
- Suggests the model learns quickly but may not generalize well.
Unstable Results:
- Outputs are repetitive or lack coherence despite low training loss.

Troubleshooting Steps

Reduced learning rate to 2e-5 with a cosine decay scheduler.
Increased weight decay to 0.05 for better regularization.
Introduced gradient checkpointing to manage memory constraints and enable larger models.

Example Training Progress

[ 12/1356 00:09 < 20:15, 1.11 it/s, Epoch 0.02/3]
Step    Training Loss
50      23.682300
100     26.042400
150     12.555300
200     10.276300

Actual Output :

Input Text: 
Instructions: Write a detailed summary on Topic in 23 sentences, each with about 22 words.

Topic: Autism
Output:

Generated Output: Instructions: Write a detailed summary on Topic in 23 sentences, each with about 22 words. Topic: Autism

Please help me understand where I am going wrong ?

Alanturner2 · December 28, 2024, 7:50am

Problem: The dataset consists of unstructured concatenation of “Instruction,” “Topic,” and “Output” without clear demarcation. This could confuse the model during training, leading it to replicate the input text or fail to generalize.
Solution:
Add Clear Separators: Use explicit separators like <instruction>, <topic>, and <output> in the dataset. For example:

plaintext

Copy code

<instruction> Write a detailed summary on <topic> Autism in 23 sentences, each with about 22 words. <output> [Expected Output]

Ensure your tokenized input-output pairs are clear and well-structured.

Dishant31 · January 6, 2025, 10:44am

Update on Dataset and Tokenization Process

Issue Identified:

The problem was traced to the format of the dataset and the tokenization process.

Changes Made:

Dataset Format:
- Previously: Single-column dataset with Instruction, Topic, and Output combined.
- Now: Segregated into two separate columns:
  - Instruction/Topic
  - Output
Tokenization Process:
- Implemented proper tokenization to convert the Output into tokenized labels.
- Youtube video that helped me understand how to structure the tokenization function
Controlled Model Output:
- Adjusted the temperature during model inference for better control over the generated outputs.

Results:

While the results are not yet ideal, they are showing promise and moving in the desired direction.

Acknowledgment:

Thank you for your help in guiding me through this process!

New Doubt:

I want to fine-tune the model to generate sentences with coherence based on instructions, such as:

Generate 3 sentences of 10 words each on topic X.

How can I ensure the fine-tuned model adheres to these specific constraints while maintaining coherence and relevance?
As I have already mentioned how I have structured my dataset, do you think it is a wise approach or I should try some other way ?

Topic		Replies	Views
Finetuning T5 on custom data Models	0	1070	November 13, 2020
Getting wrong response after fine tuning google/flan-t5-small model? Models	0	486	April 27, 2023
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	634	October 14, 2020
HF Trainer: HF trainer cause a problem while fine-tuning T5 (T5 doesn't generate eos token at proper point) 🤗Transformers	0	834	March 6, 2022
Fine-tune T5-small but lower performance Models	0	1423	April 21, 2022