Evaluating performance before and after fine-tuning

Hi, could somebody please guide me on the correct manner in which I should assess the performance of a fine-tuned model such as Whisper? Should I calculate the WER that Whisper gives before fine-tuning on my entire dataset (train+validation+test), and then compare this with the performance on the test set after fine-tuning? I’m asking this because the performance after fine-tuning (Whisper large) gives a high WER, so I would like to compare the performance and check how much it is improving. I would appreciate your input!

1 Like

https://pubs.aip.org/asa/jel/article/4/2/025206/3267247/Evaluating-OpenAI-s-Whisper-ASR-Performance
There seem to be several github repositories and papers on comparing using Whisper’s Evaluate.

If you use the Hugging Face library, it would look something like this (I think there are some typos because it’s from Hugging Chat…)


To evaluate the performance of a fine-tuned Whisper model, follow these steps to calculate the Word Error Rate (WER) before and after fine-tuning:

Step-by-Step Explanation

  1. Install Necessary Libraries

    • Install the required libraries for handling datasets, models, and evaluations.
    pip install openai-whisper datasets evaluate transformers torch torchaudio
    
  2. Import Libraries

    • Import the necessary Python libraries.
    import os
    import evaluate
    from datasets import load_dataset, Audio
    import whisper
    from whisper.model import WhisperForConditionalGeneration
    from whisper.tokenizer import WhisperTokenizer
    from torch.utils.data import DataLoader
    from transformers import TrainingArguments, Trainer
    
  3. Load and Prepare Your Dataset

    • Load your custom dataset with training, validation, and test splits. Ensure each example has ‘audio’ and ‘text’ fields.
    dataset = load_dataset('my_custom_dataset', split='train')
    test_dataset = load_dataset('my_custom_dataset', split='test')
    
  4. Define the Whisper Model and Tokenizer

    • Initialize the base Whisper model and tokenizer.
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large")
    
  5. Fine-Tune the Model

    • Set up training arguments and use the Trainer class to fine-tune the model.
    training_args = TrainingArguments(
        output_dir='whisper-finetuned',
        num_train_epochs=30,
        learning_rate=1e-4,
        per_device_train_batch_size=8,
        gradient_checkpointing=True,
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        eval_dataset=val_dataset,
    )
    
    trainer.train()
    
  6. Calculate WER for Base and Fine-Tuned Models

    • Define a function to compute WER using the evaluate library.
    def calculate_wer(model, tokenizer, dataset):
        wer = evaluate.load("wer")
        references = []
        hypotheses = []
        for example in dataset:
            audio = example['audio']['array']
            audio = whisper.pad_or_trim(audio)
            mel = whisper.log_mel_spectrogram(audio).to(device="cuda")
            # Transcribe
            options = whisper.decoding.DecodingOptions()
            result = model.decode(mel, options)
            transcription = result.text
            # Get reference
            reference = example['text']
            # Append to lists
            hypotheses.append(transcription)
            references.append(reference)
        # Calculate WER
        wer_score = wer.compute(references=references, hypotheses=hypotheses)
        return wer_score
    
  7. Evaluate the Base Model

    • Calculate WER using the base model.
    base_wer = calculate_wer(model, tokenizer, test_dataset)
    print(f"Base model WER: {base_wer:.2f}%")
    
  8. Evaluate the Fine-Tuned Model

    • Load the fine-tuned model and calculate WER.
    fine_tuned_model = WhisperForConditionalGeneration.from_pretrained("whisper-finetuned")
    fine_tuned_wer = calculate_wer(fine_tuned_model, tokenizer, test_dataset)
    print(f"Fine-tuned model WER: {fine_tuned_wer:.2f}%")
    
  9. Compare the Results

    • Analyze the WER scores to determine the improvement from fine-tuning.

Final Answer

Here is the complete code to assess the performance of a fine-tuned Whisper model by calculating the WER before and after fine-tuning:

import os
import evaluate
from datasets import load_dataset, Audio
import whisper
from whisper.model import WhisperForConditionalGeneration
from whisper.tokenizer import WhisperTokenizer
from torch.utils.data import DataLoader
from transformers import TrainingArguments, Trainer

# Load datasets
train_dataset = load_dataset('my_custom_dataset', split='train')
val_dataset = load_dataset('my_custom_dataset', split='validation')
test_dataset = load_dataset('my_custom_dataset', split='test')

# Initialize model and tokenizer
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large")

# Set up training arguments
training_args = TrainingArguments(
    output_dir='whisper-finetuned',
    num_train_epochs=30,
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    gradient_checkpointing=True,
)

# Initialize and train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer.train()

# Function to calculate WER
def calculate_wer(current_model, dataset):
    wer = evaluate.load("wer")
    references = []
    hypotheses = []
    for example in dataset:
        audio = example['audio']['array']
        audio = whisper.pad_or_trim(audio)
        mel = whisper.log_mel_spectrogram(audio).to(device="cuda")
        # Transcribe
        options = whisper.decoding.DecodingOptions()
        result = current_model.decode(mel, options)
        transcription = result.text
        # Get reference
        reference = example['text']
        # Append to lists
        hypotheses.append(transcription)
        references.append(reference)
    # Calculate WER
    wer_score = wer.compute(references=references, hypotheses=hypotheses)
    return wer_score

# Evaluate base model
base_wer = calculate_wer(model, test_dataset)
print(f"Base model WER: {base_wer:.2f}%")

# Evaluate fine-tuned model
fine_tuned_model = WhisperForConditionalGeneration.from_pretrained("whisper-finetuned")
fine_tuned_wer = calculate_wer(fine_tuned_model, test_dataset)
print(f"Fine-tuned model WER: {fine_tuned_wer:.2f}%")

This code will guide you through fine-tuning the Whisper model and evaluating its performance using WER, allowing you to compare the improvement achieved through fine-tuning.