Finetuning T5 problems

Hey Thank you again for your detailed answer. Now things became a bit clearer for me. I realized that the actual Token Classification case will come later when I have a different dataset and that I will at first focus on the seq2seq case for my dataset. Sorry for the confusion.So I am preprocessing my dataset and put the task specification token in front and I am getting the whole T5 model

prefix_s2t = "<fold2AA>"

def preprocess(ex):
        """
        Preprocess examples for seq2seq training.
        Add the <fold2AA> prefix to source sequences
        """

        # Add prefix to source sequences (3Di)
        inputs = [f"{prefix_s2t} {src}" for src in ex["src"]]
        targets = ex["tgt"]

        # Tokenize inputs (3Di sequence)
        model_inputs = tokenizer(
            inputs,
            max_length=src_max,
            truncation=True,
            padding=False,  # DataCollator will handle padding
        )
        
        # Tokenize target (AA sequence) 
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(
                targets,
                max_length=tgt_max,
                truncation=True,
                padding=False,
            )

        # Add labels to model inputs
        model_inputs["labels"] = labels["input_ids"]
        
        
        return model_inputs

train_processed = train.map(preprocess, remove_columns=train.column_names, batched=True, batch_size=1)
val_processed = val.map(preprocess, remove_columns=val.column_names, batched=True, batch_size=1)

Then I am using the DataCollatorForSeq2Seq and the Seq2SeqTrainingArguments

data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        padding='max_length',
        max_length=src_max,
        label_pad_token_id=-100,
    )

# Training arguments (following safe code pattern)
    training_args = Seq2SeqTrainingArguments(
        output_dir="finetuning_prostt5_safecode",
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        num_train_epochs=100,
        learning_rate=5e-5,
        max_grad_norm=1.0,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        eval_strategy="steps",
        eval_steps=100,  # Adjust based on your dataset size
        save_strategy="steps",
        save_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        predict_with_generate=True,
        generation_max_length=tgt_max,
        group_by_length=True,
        fp16=False,
        logging_strategy="steps",
        logging_steps=10,
        logging_first_step=True,
        report_to="none",
        remove_unused_columns=False, # added
        save_safetensors=False
    )

trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_processed,
        eval_dataset=val_processed,
        data_collator=data_collator,
        processing_class=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    )

I ran it on a small set of sequences (10 train, 1 val) whereis the corresponding sequences have a relatively short length (<50). I ran it for 100 epochs and the train_loss went from ~ 5 to ~1 and the eval_loss from ~ 2.55 to ~ 0.4 at the end. However I am not sure if I there might still be problems because the sequence recovery is very low on the evaluation step.

Thank you again very much for your patience and help

1 Like