Hey Thank you again for your detailed answer. Now things became a bit clearer for me. I realized that the actual Token Classification case will come later when I have a different dataset and that I will at first focus on the seq2seq case for my dataset. Sorry for the confusion.So I am preprocessing my dataset and put the task specification token in front and I am getting the whole T5 model
prefix_s2t = "<fold2AA>"
def preprocess(ex):
"""
Preprocess examples for seq2seq training.
Add the <fold2AA> prefix to source sequences
"""
# Add prefix to source sequences (3Di)
inputs = [f"{prefix_s2t} {src}" for src in ex["src"]]
targets = ex["tgt"]
# Tokenize inputs (3Di sequence)
model_inputs = tokenizer(
inputs,
max_length=src_max,
truncation=True,
padding=False, # DataCollator will handle padding
)
# Tokenize target (AA sequence)
with tokenizer.as_target_tokenizer():
labels = tokenizer(
targets,
max_length=tgt_max,
truncation=True,
padding=False,
)
# Add labels to model inputs
model_inputs["labels"] = labels["input_ids"]
return model_inputs
train_processed = train.map(preprocess, remove_columns=train.column_names, batched=True, batch_size=1)
val_processed = val.map(preprocess, remove_columns=val.column_names, batched=True, batch_size=1)
Then I am using the DataCollatorForSeq2Seq and the Seq2SeqTrainingArguments
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
model=model,
padding='max_length',
max_length=src_max,
label_pad_token_id=-100,
)
# Training arguments (following safe code pattern)
training_args = Seq2SeqTrainingArguments(
output_dir="finetuning_prostt5_safecode",
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
num_train_epochs=100,
learning_rate=5e-5,
max_grad_norm=1.0,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
eval_strategy="steps",
eval_steps=100, # Adjust based on your dataset size
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
predict_with_generate=True,
generation_max_length=tgt_max,
group_by_length=True,
fp16=False,
logging_strategy="steps",
logging_steps=10,
logging_first_step=True,
report_to="none",
remove_unused_columns=False, # added
save_safetensors=False
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_processed,
eval_dataset=val_processed,
data_collator=data_collator,
processing_class=tokenizer,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
I ran it on a small set of sequences (10 train, 1 val) whereis the corresponding sequences have a relatively short length (<50). I ran it for 100 epochs and the train_loss went from ~ 5 to ~1 and the eval_loss from ~ 2.55 to ~ 0.4 at the end. However I am not sure if I there might still be problems because the sequence recovery is very low on the evaluation step.
Thank you again very much for your patience and help