Evaluating performance before and after fine-tuning

itskavya · March 20, 2025, 3:12pm

Hi, could somebody please guide me on the correct manner in which I should assess the performance of a fine-tuned model such as Whisper? Should I calculate the WER that Whisper gives before fine-tuning on my entire dataset (train+validation+test), and then compare this with the performance on the test set after fine-tuning? I’m asking this because the performance after fine-tuning (Whisper large) gives a high WER, so I would like to compare the performance and check how much it is improving. I would appreciate your input!

John6666 · March 20, 2025, 5:44pm

https://pubs.aip.org/asa/jel/article/4/2/025206/3267247/Evaluating-OpenAI-s-Whisper-ASR-Performance
There seem to be several github repositories and papers on comparing using Whisper’s Evaluate.

If you use the Hugging Face library, it would look something like this (I think there are some typos because it’s from Hugging Chat…)

To evaluate the performance of a fine-tuned Whisper model, follow these steps to calculate the Word Error Rate (WER) before and after fine-tuning:

Step-by-Step Explanation

Install Necessary Libraries
- Install the required libraries for handling datasets, models, and evaluations.
```
pip install openai-whisper datasets evaluate transformers torch torchaudio
```

Import Libraries

Import the necessary Python libraries.

import os
import evaluate
from datasets import load_dataset, Audio
import whisper
from whisper.model import WhisperForConditionalGeneration
from whisper.tokenizer import WhisperTokenizer
from torch.utils.data import DataLoader
from transformers import TrainingArguments, Trainer

Load and Prepare Your Dataset
- Load your custom dataset with training, validation, and test splits. Ensure each example has ‘audio’ and ‘text’ fields.
```
dataset = load_dataset('my_custom_dataset', split='train')
test_dataset = load_dataset('my_custom_dataset', split='test')
```

Define the Whisper Model and Tokenizer

Initialize the base Whisper model and tokenizer.

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large")

Fine-Tune the Model

Set up training arguments and use the Trainer class to fine-tune the model.

training_args = TrainingArguments(
    output_dir='whisper-finetuned',
    num_train_epochs=30,
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    gradient_checkpointing=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Calculate WER for Base and Fine-Tuned Models

Define a function to compute WER using the evaluate library.

def calculate_wer(model, tokenizer, dataset):
    wer = evaluate.load("wer")
    references = []
    hypotheses = []
    for example in dataset:
        audio = example['audio']['array']
        audio = whisper.pad_or_trim(audio)
        mel = whisper.log_mel_spectrogram(audio).to(device="cuda")
        # Transcribe
        options = whisper.decoding.DecodingOptions()
        result = model.decode(mel, options)
        transcription = result.text
        # Get reference
        reference = example['text']
        # Append to lists
        hypotheses.append(transcription)
        references.append(reference)
    # Calculate WER
    wer_score = wer.compute(references=references, hypotheses=hypotheses)
    return wer_score

Evaluate the Base Model

Calculate WER using the base model.

base_wer = calculate_wer(model, tokenizer, test_dataset)
print(f"Base model WER: {base_wer:.2f}%")

Evaluate the Fine-Tuned Model

Load the fine-tuned model and calculate WER.

fine_tuned_model = WhisperForConditionalGeneration.from_pretrained("whisper-finetuned")
fine_tuned_wer = calculate_wer(fine_tuned_model, tokenizer, test_dataset)
print(f"Fine-tuned model WER: {fine_tuned_wer:.2f}%")

Compare the Results
- Analyze the WER scores to determine the improvement from fine-tuning.

Final Answer

Here is the complete code to assess the performance of a fine-tuned Whisper model by calculating the WER before and after fine-tuning:

import os
import evaluate
from datasets import load_dataset, Audio
import whisper
from whisper.model import WhisperForConditionalGeneration
from whisper.tokenizer import WhisperTokenizer
from torch.utils.data import DataLoader
from transformers import TrainingArguments, Trainer

# Load datasets
train_dataset = load_dataset('my_custom_dataset', split='train')
val_dataset = load_dataset('my_custom_dataset', split='validation')
test_dataset = load_dataset('my_custom_dataset', split='test')

# Initialize model and tokenizer
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large")

# Set up training arguments
training_args = TrainingArguments(
    output_dir='whisper-finetuned',
    num_train_epochs=30,
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    gradient_checkpointing=True,
)

# Initialize and train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer.train()

# Function to calculate WER
def calculate_wer(current_model, dataset):
    wer = evaluate.load("wer")
    references = []
    hypotheses = []
    for example in dataset:
        audio = example['audio']['array']
        audio = whisper.pad_or_trim(audio)
        mel = whisper.log_mel_spectrogram(audio).to(device="cuda")
        # Transcribe
        options = whisper.decoding.DecodingOptions()
        result = current_model.decode(mel, options)
        transcription = result.text
        # Get reference
        reference = example['text']
        # Append to lists
        hypotheses.append(transcription)
        references.append(reference)
    # Calculate WER
    wer_score = wer.compute(references=references, hypotheses=hypotheses)
    return wer_score

# Evaluate base model
base_wer = calculate_wer(model, test_dataset)
print(f"Base model WER: {base_wer:.2f}%")

# Evaluate fine-tuned model
fine_tuned_model = WhisperForConditionalGeneration.from_pretrained("whisper-finetuned")
fine_tuned_wer = calculate_wer(fine_tuned_model, test_dataset)
print(f"Fine-tuned model WER: {fine_tuned_wer:.2f}%")

This code will guide you through fine-tuning the Whisper model and evaluating its performance using WER, allowing you to compare the improvement achieved through fine-tuning.

Topic		Replies	Views
Whisper fine-tuning on Librispeech makes WER worse 🤗Transformers	6	2408	June 26, 2023
Whisper for Audio Classification 🤗Transformers	3	2831	October 9, 2024
Fine-tuning Whisper for Audio Classification Models	6	3258	November 8, 2024
Korean finetuning on Whisper Beginners	1	1605	February 25, 2024
Whisper fine-tuning slow eval Models	0	452	February 28, 2024

Evaluating performance before and after fine-tuning

Step-by-Step Explanation

Final Answer

Related topics