Trainer RuntimeError: The size of tensor a (462) must match the size of tensor b (448) at non-singleton dimension 1

So I checked Whisper feature extractor and Whisper tokenizer.

I assume the problem here is: there are 7 samples that have input lengths exceeding 30s, and 10 samples that have label lengths exceeding the max length for the model, which is 448.
So I tried:

MAX_DURATION_IN_SECONDS = 30.0
max_input_length = MAX_DURATION_IN_SECONDS * 16000

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], 
                                                sampling_rate=audio["sampling_rate"],
                                                max_length=max_input_length).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["raw_transcription"], 
                                truncation=True,
                                max_length=448).input_ids

While I guess the truncation in the feature extractor doesn’t matter (?) since the output feature size will be fixed to 80, this works for me to proceed with the training. Please correct me if my understanding is wrong!

2 Likes