[Open-to-the-community] Whisper fine-tuning event

sanchit-gandhi · December 8, 2022, 11:43am

Hey @steja! This is pretty unlucky It means that we have a sample with 504 tokens in our training set, but the model has a max length of 448. Could you add an extra filter step to your dataset before you instantiate the Trainer:

max_label_length = model.config.max_length

def filter_labels(labels):
    """Filter label sequences longer than max length"""
    return len(labels) < max_label_length

vectorized_datasets = vectorized_datasets.filter(filter_labels, input_columns=["labels"])

This should fix the issue!

Topic		Replies	Views
[Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages Languages at Hugging Face	411	17416	December 9, 2021
Has Anyone Successfully Fine-Tuned Whisper for a Local Language for better accuracy Beginners	5	197	May 27, 2025
[Open-to-the-community] Robust Speech Recognition Challenge Languages at Hugging Face	24	12488	January 29, 2022
Weights & Biases supporting Whisper Fine-tuning :partying_face: Community Calls	4	644	December 9, 2022
Fine-tuning Whisper for Audio Classification Models	6	3259	November 8, 2024

[Open-to-the-community] Whisper fine-tuning event

Related topics