Fine-Tune Whisper Tensor size mismatch

Hello! I’m trying to following this blog in order to fine-tune Whisper on my data set. While training, I’m getting this error


Although during preparing my data I filtered the labels length, as @sanchit-gandhi suggested, to be less than the max_lenght of the model (448) but still getting the same error :face_with_diagonal_mouth:
Here are the link for my colab notebook
what can I do ?

Hey @RetaSy! Sorry for the delay in getting back to you! Unfortunately I can’t access your notebook (need permissions!). Feel free to update them and ping me here, I can then take a more detailed look!

In the mean time, could you double check that the extra filter step is implemented before you instantiate the Trainer:

max_label_length = model.config.max_length

def filter_labels(labels):
    """Filter label sequences longer than max length"""
    return len(labels) < max_label_length

vectorized_datasets = vectorized_datasets.filter(filter_labels, input_columns=["labels"])

trainer = (train_dataset= vectorized_datasets["train"], ...)

Thanks!

1 Like

Hi @sanchit-gandhi, I got the same problem when trying to fine-tune camembert. I used the filter as you suggested and it works. However, the filter has downscaled too much of my dataset and then the model’s accuray is really bad. Do we have another way to deal with it? (As I see, my error come from: /transformers/models/camembert/modeling_camembert.py, line 871, in forward). Thanks in advance for your help.

Hey @maitrang!

Welcome to the forum and thanks for opening up your first question post :hugs: Awesome to have you here!

What you can do is first increase the value of the generation max length to some arbitrarily large value (e.g. 1024):

model.config.max_length = 1024

And then perform the filtering stage. By increasing the max length, we’ll raise the filter threshold for our dataset and thus filter less of it. This will give us more data to train on. However, it will also increase the memory requirement for training as we have potentially longer sequences in our training data.

Hope that answers your question!

1 Like

Hello! I’m trying to fine-tune an already fine-tuned Whisper model on my dataset. Previously, I fine-tuned it on the Common Voice dataset, and as you suggested in a previous post, I filtered it to only include data with up to 448 tokens. This gave me decent results with a Word Error Rate (WER) of 19%.

Now, I’m trying to fine-tune this model on another dataset that I created. This dataset is just a JSON file containing audio file paths and corresponding text. However, the WER got worse, reaching 60%. Can you explain why this might have happened?

Additionally, I noticed that after filtering to 448 tokens, I lost a significant amount of data in my dataset. Is there a way to increase this limit so that I don’t have to cut off most of the data? What should I do now—should I start fine-tuning Whisper from scratch by combining both datasets, or did I make a mistake during the second round of fine-tuning?