Hello eveyrone, I am trying to finetune whisper-large-v3 using my own dataset.
I used Bofeng Huang tutorial with some tweaks to manage that. (on medium here)
Problem is at validation, my compute metric function to not find the tokenizer.
When calling the compute_metrics, the function to not find the tokenizer.
Is it normal? Did the Seq2SeqTrainer trainer changed since this tutorial?
def compute_metrics(pred, do_normalize_eval=False):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
#label_ids[label_ids == -100] = tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
if do_normalize_eval:
pred_str = [normalizer(pred) for pred in pred_str]
# perhaps already normalised
label_str = [normalizer(label) for label in label_str]
# filtering step to only evaluate the samples that correspond to non-zero references
pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]
label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]
wer = metric.compute(predictions=pred_str, references=label_str)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=vectorized_datasets["train"],
eval_dataset=vectorized_datasets["test"],
tokenizer=processor,
data_collator=data_collator,
compute_metrics=compute_metrics,
)