Hello eveyrone, I am trying to finetune whisper-large-v3 using my own dataset.
I used Bofeng Huang tutorial with some tweaks to manage that. (on medium here)
Problem is at validation, my compute metric function to not find the tokenizer.
When calling the compute_metrics, the function to not find the tokenizer.
Is it normal? Did the Seq2SeqTrainer trainer changed since this tutorial?
def compute_metrics(pred, do_normalize_eval=False):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    # replace -100 with the pad_token_id
    #label_ids[label_ids == -100] = tokenizer.pad_token_id
    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    if do_normalize_eval:
        pred_str = [normalizer(pred) for pred in pred_str]
        # perhaps already normalised
        label_str = [normalizer(label) for label in label_str]
        # filtering step to only evaluate the samples that correspond to non-zero references
        pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]
        label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]
    wer = metric.compute(predictions=pred_str, references=label_str)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=vectorized_datasets["train"],
    eval_dataset=vectorized_datasets["test"],
    tokenizer=processor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)