Finetuning T5 problems

Jannik97 · September 15, 2025, 3:08pm

Hey Thank you again for your detailed answer. Now things became a bit clearer for me. I realized that the actual Token Classification case will come later when I have a different dataset and that I will at first focus on the seq2seq case for my dataset. Sorry for the confusion.So I am preprocessing my dataset and put the task specification token in front and I am getting the whole T5 model

prefix_s2t = "<fold2AA>"

def preprocess(ex):
        """
        Preprocess examples for seq2seq training.
        Add the <fold2AA> prefix to source sequences
        """

        # Add prefix to source sequences (3Di)
        inputs = [f"{prefix_s2t} {src}" for src in ex["src"]]
        targets = ex["tgt"]

        # Tokenize inputs (3Di sequence)
        model_inputs = tokenizer(
            inputs,
            max_length=src_max,
            truncation=True,
            padding=False,  # DataCollator will handle padding
        )
        
        # Tokenize target (AA sequence) 
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(
                targets,
                max_length=tgt_max,
                truncation=True,
                padding=False,
            )

        # Add labels to model inputs
        model_inputs["labels"] = labels["input_ids"]
        
        
        return model_inputs

train_processed = train.map(preprocess, remove_columns=train.column_names, batched=True, batch_size=1)
val_processed = val.map(preprocess, remove_columns=val.column_names, batched=True, batch_size=1)

Then I am using the DataCollatorForSeq2Seq and the Seq2SeqTrainingArguments

data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        padding='max_length',
        max_length=src_max,
        label_pad_token_id=-100,
    )

# Training arguments (following safe code pattern)
    training_args = Seq2SeqTrainingArguments(
        output_dir="finetuning_prostt5_safecode",
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        num_train_epochs=100,
        learning_rate=5e-5,
        max_grad_norm=1.0,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        eval_strategy="steps",
        eval_steps=100,  # Adjust based on your dataset size
        save_strategy="steps",
        save_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        predict_with_generate=True,
        generation_max_length=tgt_max,
        group_by_length=True,
        fp16=False,
        logging_strategy="steps",
        logging_steps=10,
        logging_first_step=True,
        report_to="none",
        remove_unused_columns=False, # added
        save_safetensors=False
    )

trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_processed,
        eval_dataset=val_processed,
        data_collator=data_collator,
        processing_class=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    )

I ran it on a small set of sequences (10 train, 1 val) whereis the corresponding sequences have a relatively short length (<50). I ran it for 100 epochs and the train_loss went from ~ 5 to ~1 and the eval_loss from ~ 2.55 to ~ 0.4 at the end. However I am not sure if I there might still be problems because the sequence recovery is very low on the evaluation step.

Thank you again very much for your patience and help

Topic		Replies	Views
Issue with finetuning a seq-to-seq model 🤗Transformers	30	4056	August 11, 2022
Errors when fine-tuning T5 Beginners	7	6573	January 3, 2022
Passing Inputs Longer Than 512 Tokens After Pretraining a T5 Model: Is It Safe? 🤗Transformers	3	67	November 20, 2025
T5 Finetuning Tips Models	48	57421	November 3, 2024
T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning 🤗Transformers	0	344	July 5, 2023

Finetuning T5 problems

Related topics