Did you solve your issue? I think passing the optimizer is enough, you dont need to pass it again as optim=‘adafactor’ in the Training Arguments.
Sharing my results with transfer learning flan-t5-small for translation.
Experiment 1:
Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)
# no scheduler
Experiment 2:
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)
Got faster convergence with experiment 1, but better end performance with experiment 2:
Hi! @pierreguillou
I tried the method you mentioned, i.e. using adafactor in the Trainer of hugginface transformers to fine tune the original version of T5.
The version of transformers I am using is 4.28.1.
I used the run_translation.py script like you did. The script defaults to AdamW. According to the latest transformers documentation, I use “–optim adafactor” to select adafactor and "–learning_rate 1e-3 " to set the learning_rate to 1e-3.
Basically, this is the same way you are using it. Ultimately, I did not observe the issue you mentioned regarding eval_loss and learning_rate. However, the results I got show that using adafactor is not as good as AdamW, all other parameters being equal. on my own dataset, of course.
I would like to ask, do you have any new findings and tips about using adafactor in Trainer?
Some people have mentioned multi-task fine tuning of T5 and I am wondering if anyone has been successful in fine tuning T5 on different task types. For example, fine tuning Q&A as well as Summarization in the same model. I can train them separately using their corresponding models (eg: AutoModelForQuestionAnswering or AutoModelForSeq2SeqLM). However, when I try to merge the datasets using interleaving or concating, and use T5ForConditionalGeneration to cover both tasks, only the Summarizer works. Does anyone have examples that show mutl task types (eg: Q&A + Summarizer) in the same model training?
would like to know as well
I’m facing something similar: passing optim=“adafactor”, and with or without setting a learning_rate (or letting the default being set), each training phase shows a learning rate of “0.0” so my model is never updating.
Hi everyone, I’m mt5-small a go locally with 32k samples of varying length up to 512 tokens. Not super familiar with training this model.
Got 2x12gb rtx.
Previously managed to successfully finetune MarianMT models on this hardware, and the training was quite stable with 32 gradient accumulation steps, AdamW and 5e-7 learning rate for 24 epochs, and it performs well on the task it was finetuned on.
I’ve tried the same settings, except the adjusted learning rate to the advised 3e-4 with mt5-small and I’ve got some concerns, perhaps some of you could help to address
- training appears to be going much slower
- why can’t i fit base size on 2x12gb, with sample size of up to 512 tokens
- gradient norm has ridiculously high values
- evals metrics have really low values
I’m using the same dataset, and the same training procedure as for MarianMT (except for learning rate). Why are the differences so large?
Here are some code snippets that might be relevant from my tuner class
@valhalla @sshleifer should I instead append the eos token to the end of each label and return the str labels instead of tokens? Where do you get the task prefix?
Is there a list of tax prefixes used by google for mt5? I guess it would be easier if I plugged in to an existing prefix.
def _preprocess_function(self, examples):
"""
Tokenize and preprocess a batch of examples for model training.
"""
# Add task-specific prefix
prefix = f"<{self.src_key}2{self.mt_key}>"
inputs = [prefix + text for text in examples[self.src_key]]
targets = examples[self.mt_key]
model_inputs = self.tokenizer(inputs, max_length=self.samples_filter_max_tokens, padding='max_length', truncation=True)
with self.tokenizer.as_target_tokenizer():
labels = self.tokenizer(targets, max_length=self.samples_filter_max_tokens, padding='max_length', truncation=True)
# Replace pad token ID with -100 to ignore in loss computation
labels["input_ids"] = [
[(label if label != self.tokenizer.pad_token_id else -100) for label in labels_seq]
for labels_seq in labels["input_ids"]
]
model_inputs['labels'] = labels['input_ids']
return model_inputs
@moscow25 @valhalla @sshleifer @mrm8488 should I just set adafactor=True
in my Seq2SeqTrainingArguments?
What about fp16, anyone had success?
What about multiply_by_parametr_scale=True
is this something I have to additionaly configure, or is it a default value?
def run(self):
"""
Execute the main training loop for fine-tuning the model.
"""
self.setup() # Ensure setup is complete, loads model, datasets
# Preprocess datasets
self.log.info("Tokenizing training dataset...")
tokenized_train_dataset = self.train_dataset.map(self._preprocess_function, batched=True)
self.log.info("Tokenizing validation dataset...")
tokenized_eval_dataset = self.eval_dataset.map(self._preprocess_function, batched=True)
data_collator = DataCollatorForSeq2Seq(
tokenizer=self.tokenizer,
model=self.model,
label_pad_token_id=-100,
)
training_args = Seq2SeqTrainingArguments(
output_dir=self.output_dir,
logging_steps=self.logging_steps,
save_steps=self.save_steps,
eval_steps=self.eval_steps,
eval_strategy="steps",
predict_with_generate=True,
report_to="wandb" if not self.is_test_run else [],
metric_for_best_model=self.metric_for_best_model,
greater_is_better=True,
load_best_model_at_end=True,
save_total_limit=self.save_total_limit,
learning_rate=self.learning_rate,
weight_decay=self.weight_decay,
per_device_train_batch_size=self.per_device_train_batch_size,
per_device_eval_batch_size=self.per_device_eval_batch_size,
gradient_accumulation_steps=self.gradient_accumulation_steps,
auto_find_batch_size=self.auto_find_batch_size,
num_train_epochs=self.num_train_epochs,
run_name=f"{self.run_name}",
seed=self.seed,
fp16=True, # Enable mixed precision if supported
)
trainer = Seq2SeqTrainer(
model=self.model,
args=training_args,
compute_metrics=self._compute_mt_metrics,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_eval_dataset,
tokenizer=self.tokenizer,
data_collator=data_collator,
)
trainer.train()
Kind thanks!
Since this is super old thread I opened a follow up here Finetuning mT5 for specific language pair
Hello, I am trying to fine tune t5 for translating a certain dialect, is there a notebook built with all these tips in mind that I can utilize? and what is the state of training on fp16 today? any help would be greatly appreciated especially with what prefix to use and what to end the labels with, Basically if there is starter code that would be wonderful.